New to R btw so I am sorry if it seems like a stupid question.
So basically I have a dataframe with 100 rows and 3 different columns of data. I also have a vector with 3 thresholds, one for each column. I was wondering how you would filter out the values of each column that are superior to the value of each threshold.
Edit: Sry for the incomplete question.
So essentially what i would like to create is a function (that takes a dataframe and a vector of tresholds as parameters) that applies every treshold to their respective column of the dataframe (so there is one treshhold for every column of the dataframe). The number of elements of each column that “respect” their treshold should later be put in a vector. So for example:
Column 1: values = 1,2,3. Treshold = (only values lower than 3)
Column 2: values = 4,5,6. Treshold = (only values lower than 6)
Output: A vector (2,2) since there are two elements in each column that are under their respective tresholds.
Thank you everyone for the help!!
Your example data:
df <- data.frame(a = 1:3, b = 4:6)
threshold <- c(3, 6)
One option to resolve your question is to use sapply(), which applies a function over a list or vector. In this case, I create a vector for the columns in df with 1:ncol(df). Inside the function, you can count the number of values less than a given threshold by summing the number of TRUE cases:
col_num <- 1:ncol(df)
sapply(col_num, function(x) {sum(df[, x] < threshold[x])})
Or, in a single line:
sapply(1:ncol(df), function(x) {sum(df[, x] < threshold[x])})
I have a dataframe with 14 columns, and I want to subset a dataframe with the same column but keeping only row that repeats (for example, I have an ID variable and if ID = 2 repeated so I subset it).
To begin, I applied a table to my dataframe to see the frequencies of ID
head(sort(table(call.dat$IMSI), decreasing = TRUE), 100)
In my case, 20801170106338 repeat two time; so I want to see the two observation for this ID.
Afterward, I did x <- subset(call.dat, IMSI == "20801170106338") and hsb6 <- call.dat[call.dat$IMSI == "20801170106338", ], but the result is false (for x, it's returning me 0 observation of 14 variale and for hsb6 I have only NA in my dataframe).
Can you help me, thanks.
PS: IMSI is a numeric value.
And x <- subset(call.dat, Handset.Manufacturer == "LG") is another example which works perfectly...
You can use duplicated that is a function giving you an array that is TRUE in case the record is duplicated.
isDuplicated <- duplicated(call.dat$IMSI)
Then, you can extract all the rows containing a duplicated value.
call.dat.duplicated <- all.dat[isDuplicated, ]
I have a large matrix from which I would like to randomly extract a smaller matrix. (I want to do this 1000 times, so ultimately it will be in a for loop.) Say for example that I have this 9x9 matrix:
mat=matrix(c(0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,
0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,
1,0,1,0,0,0,0,0,1,0,1,0,0,0,1), nrow=9)
From this matrix, I would like a random 3x3 subset. The trick is that I do not want any of the row or column sums in the final matrix to be 0. Another important thing is that I need to know the original number of the rows and columns in the final matrix. So, if I end up randomly selecting rows 4, 5, and 7 and columns 1, 3, and 8, I want to have those identifiers easily accessible in the final matrix.
Here is what I've done so far.
First, I create a vector of row numbers and column numbers. I am trying to keep these attached to the matrix throughout.
r.num<-seq(from=1,to=nrow(mat),by=1) #vector of row numbers
c.num<-seq(from=0, to=(ncol(mat)+1),by=1) #vector of col numbers (adj for r.num)
mat.1<-cbind(r.num,mat)
mat.2<-rbind(c.num,mat.1)
Now I have a 10x10 matrix with identifiers. I can select my rows by creating a random vector and subsetting the matrix.
rand <- sample(r.num,3)
temp1 <- rbind(mat.2[1,],mat.2[rand,]) #keep the identifier row
This works well! Now I want to randomly select 3 columns. This is where I am running into trouble. I tried doing it the same way.
rand2 <- sample(c.num,3)
temp2 <- cbind(temp1[,1],temp1[,rand2])
The problem is that I end up with some row and column sums that are 0. I can eliminate columns that sum to 0 first.
temp3 <- temp1[,which(colSums(temp1[2:nrow(temp1),])>0)]
cols <- which(colSums(temp1[2:nrow(temp1),2:ncol(temp1)])>0)
rand3 <- sample(cols,3)
temp4 <- cbind(temp3[,1],temp3[,rand3])
But I end up with an error message. For some reason, R does not like to subset the matrix this way.
So my question is, is there a better way to subset the matrix by the random vector "rand3" after the zero columns have been removed OR is there a better way to randomly select three complementary rows and columns such that there are none that sum to 0?
Thank you so much for your help!
If I understood your problem, I think this would work:
mat=matrix(c(0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,
0,0,0,0,1,1,1,0,0,1,0,1,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,0,1,1,1,0,0,
1,0,1,0,0,0,0,0,1,0,1,0,0,0,1), nrow=9)
smallmatrix = matrix(0,,nrow=3,ncol=3)
while(any(apply(smallmatrix,2,sum) ==0) | any(apply(smallmatrix,1,sum) ==0)){
cols = sample(ncol(mat),3)
rows= sample(nrow(mat),3)
smallmatrix = mat[rows,cols]
}
colnames(smallmatrix) = cols
rownames(smallmatrix) = rows
I'm stuck with a simple loop that takes more than an hour to run, and need help to speed it up.
Basically, I have a matrix with 31 columns and 400 000 rows. The first 30 columns have values, and the 31st column has a column-number. I need to, per row, retrieve the value in the column indicated by the 31st column.
Example row: [26,354,72,5987..,461,3] (this means that the value in column 3 is sought after (72))
The too slow loop looks like this:
a <- rep(0,nrow(data)) #To pre-allocate memory
for (i in 1:nrow(data)) {
a[i] <- data[i,data[i,31]]
}
I would think this would work:
a <- data[,data[,31]]
... but it results in "Error: cannot allocate vector of size 2.8 Mb".
I fear that this is a really simple question, so I've spent hours trying to understand apply, lapply, reshape, and more, but somehow I can't get a grip on the vectorization concept in R.
The matrix actually has even more columns that also go into the a-parameter, which is why I don't want to rebuild the matrix, or split it.
Your support is highly appreciated!
Chris
t(data[,1:30])[30*(0:399999)+data[,31]]
This works because you can reference matricies both in array format, and vector format (a 400000*31 long vector in this case) counting column-wise first. To count row-wise, you use the transpose.
Singe-index notation for the matrix may use less memory. This would involve doing something like:
i <- nrow(data)*(data[,31]-1) + 1:nrow(data)
a <- data[i]
Below is an example of single-index notation for matrices in R. In this example, the index of the per-row maximum is appended as the last column of a random matrix. This last column is then used to select the per-row maxima via single-index notation.
## create a random (10 x 5) matrix
M <- matrix(rpois(50,50),10,5)
## use the last column to index the maximum value of the first 5
## columns
MM <- cbind(M,apply(M,1,which.max))
## column ID row ID
i <- nrow(MM)*(MM[,ncol(MM)]-1) + 1:nrow(MM)
all(MM[i] == apply(M,1,max))
Using an index matrix is an alternative that will probably use more memory but is slightly clearer:
ii <- cbind(1:nrow(MM),MM[,ncol(MM)])
all(MM[ii] == apply(M,1,max))
Try to change the code to work a column at a time:
M <- matrix(rpois(30*400000,50),400000,30)
MM <- cbind(M,apply(M,1,which.max))
a <- rep(0,nrow(MM))
for (i in 1:(ncol(MM)-1)) {
a[MM[, ncol(MM)] == i] <- MM[MM[, ncol(MM)] == i, i]
}
This sets all elements in a with the values from column i if the last column has value i. It took longer to build the matrix than to calculate vector a.