R: replace columns of matrix if vector fulfills certain condition [duplicate] - r

I have a dataframe as follows
Me You They Him She
1 4 6 3 233
82 0 2 4 122
98 2 5 2 99
I want to get a new dataframe which only contains those columns where the colMeans are >30 so the result should look like
Me She
1 233
82 122
98 99
I tried something like
dfNew<-subset(df,colMeans(df[, 1:ncol(df)]>30))
but got the error
Error in subset.data.frame(df[, 1:ncol(df)]> :
'subset' must be logical
Clearly don't know what Im doing.

try this:
df[,colMeans(df)>30]

I Think this is something that you are looking for
This step is just me creating your data.
Me <- c(1,82,98)
You <- c(4,0,2)
They <- c(6,5,5)
Him <- c(3,4,2)
She <- c(233,122,99)
df <- as.data.frame(cbind(Me, You, They, Him, She))
This is what you want.
df[, sapply(df, mean) > 60]

Related

R countif and sum on multiple columns matching elements in specified vector

I am applying this function to my dataset column DL1 on another vector as below and receiving the results expected
table(df$DL1[df$DL1 %in% undefined_dl_codes])
Result:
0 10 30 3B 4 49 54 5A 60 7 78 8 90
24 366 4 3 665 40 1 1 14 8 4 87 1
however I do have columns DL2, DL3 and DL4 which have same data, how can I apply the function to multiple columns and receive the result of all. I would need to go through all 4 required columns and receive 1 result as summary.
Any help highly appreciated!
May not be the best of the methods, however you could do the following
table(c(df$DL1[df$DL1 %in% undefined_dl_codes],
df$DL2[df$DL2 %in% undefined_dl_codes],
df$DL3[df$DL3 %in% undefined_dl_codes],
df$DL4[df$DL4 %in% undefined_dl_codes]
)
)
Using Raghuveer solution I further simplified,
attach(df)
table(c(DL1,DL2,DL3,DL4)[c(DL1,DL2,DL3,DL4) %in% undefined_dl_codes])
detach(df)

R lapply(): Change all columns within all data frames in a list to numeric, then convert all values to percentages

Question:
I am a little stumped as to how I can batch process as.numeric() (or any other function for that matter) for columns in a list of data frames.
I understand that I can view specific data frames or colunms within this list by using:
> my.list[[1]]
# or columns within this data frame using:
> my.list[[1]][1]
But my trouble comes when I try to apply this into an lapply() function to change all of the data from integer to numeric.
# Example of what I am trying to do
> my.list[[each data frame in list]][each column in data frame] <-
as.numberic(my.list[[each data frame in list]][each column in data frame])
If you can assist me in any way, or know of any resources that can help me out I would appreciate it.
Background:
My data frames are structured as the below example, where I have 5 habitat types and information on how much area an individual species home range extends to n :
# Example data
spp.1.data <- data.frame(Habitat.A = c(100,45,0,9,0), Habitat.B = c(0,0,203,45,89), Habitat.C = c(80,22,8,9,20), Habitat.D = c(8,59,77,83,69), Habitat.E = c(23,15,99,0,10))
I have multiple data frames with the above structure which I have assigned to a list object:
all.spp.data <- list(spp.1.data, spp.2.data, spp.1.data...n)
I am then trying to coerce all data frames to as.numeric() so I can create data frames of % habitat use i.e:
# data, which is now numeric as per Phil's code ;)
data.numeric <- lapply(data, function(x) {
x[] <- lapply(x, as.numeric)
x
})
> head(data.numeric[[1]])
Habitat.A Habitat.B Habitat.C Habitat.D Habitat.E
1 100 0 80 8 23
2 45 0 22 59 15
3 0 203 8 77 99
4 9 45 9 83 0
5 0 89 20 69 10
EDIT: I would like to sum every row, in all data frames
# Add row at the end of each data frame populated by rowSums()
f <- function(i){
data.numeric[[i]]$Sums <- rowSums(data.numeric[[i]])
data.numeric[[i]]
}
data.numeric.SUM <- lapply(seq_along(data.numeric), f)
head(data.numeric.SUM[[1]])
Habitat.A Habitat.B Habitat.C Habitat.D Habitat.E Sums
1 100 0 80 8 23 211
2 45 0 22 59 15 141
3 0 203 8 77 99 387
4 9 45 9 83 0 146
5 0 89 20 69 10 188
EDIT: This is the code I used to convert values within the data frames to % habitat used
# Used Phil's logic to convert all numbers in percentages
data.numeric.SUM.perc <- lapply(data.numeric.SUM,
function(x) {
x[] <- (x[]/x[,6])*100
x
})
Perc.Habitat.A Perc.Habitat.B Perc.Habitat.C Perc.Habitat.D Perc.Habitat.E
1 47 32 0 6 0
2 0 0 52 31 47
3 38 16 2 6 11
4 4 42 20 57 37
5 11 11 26 0 5
6 100 100 100 100 100
This is still not the most condensed way to do this, but it did the trick for me.
Thank you, Phil, Val and Leo P, for helping with this problem.
I'd do this a bit more explicitly:
all.spp.data <- lapply(all.spp.data, function(x) {
x[] <- lapply(x, as.numeric)
x
})
As a personal preference, this clearly conveys to me that I'm looping over each column in a data frame, and looping over each data frame in a list.
If you really want to do it all with lapply, here's a way to go:
lapply(all.spp.data,function(x) do.call(cbind,lapply(1:nrow(x),function(y) as.numeric(x[,y]))))
This uses a nested lapply call. The first one references the single data.frames to x. The second one references the column index for each x to y. So in the end I can reference each column by x[,y].
Since everything will be split up in single vectors, I'm calling do.call(cbind, ... ) to bring it back to a matrix. If you prefer you could add data.frame() around it to bring it back into the original type.

How to multiple 2 columns using [nested for loop] in a dataframe in r [duplicate]

This question already has answers here:
calculate row sum and product in data.frame
(4 answers)
Closed 5 years ago.
This is really driving me nuts. Say I have 3 dataset I am combining into a dataframe.
a = c(1,2,3,4)
b = c(2,4,6,8)
c = c(3,6,9,12)
d = as.data.frame(cbind(a,b,c)) # combined dataframe
I could just write:
multiply = d[,1]*d[,2]*d[,3]
#[1] 6 48 162 384
But this is not feasible in case I have many columns, so I need nested for- loop statement, so this is what I attempt:
for (col in 1:ncol(d)){
for (j in 1:ncol(d)){
multiply=0
multiply = d[,col]*d[,j]
}
}
print(multiply)
#[1] 9 36 81 144
It just took the column 2 and multiple with column 3. WHY???? Any improvement of my nested for-loopwill be highly appreciated, this is what i am interested to know more about.
Please do not suggest a solution which involves using apply functions. I am already aware of them.
We can use Reduce
Reduce(`*`, d)
#[1] 6 48 162 384
Or with rowProds from library(matrixStats)
library(matrixStats)
rowProds(as.matrix(d))
#[1] 6 48 162 384
If we need a for loop
v1 <- rep(1, nrow(d))
for(j in seq_along(d)){
v1 <- v1*d[,j]
}
v1
#[1] 6 48 162 384

How to subset dataframe based on colMeans

I have a dataframe as follows
Me You They Him She
1 4 6 3 233
82 0 2 4 122
98 2 5 2 99
I want to get a new dataframe which only contains those columns where the colMeans are >30 so the result should look like
Me She
1 233
82 122
98 99
I tried something like
dfNew<-subset(df,colMeans(df[, 1:ncol(df)]>30))
but got the error
Error in subset.data.frame(df[, 1:ncol(df)]> :
'subset' must be logical
Clearly don't know what Im doing.
try this:
df[,colMeans(df)>30]
I Think this is something that you are looking for
This step is just me creating your data.
Me <- c(1,82,98)
You <- c(4,0,2)
They <- c(6,5,5)
Him <- c(3,4,2)
She <- c(233,122,99)
df <- as.data.frame(cbind(Me, You, They, Him, She))
This is what you want.
df[, sapply(df, mean) > 60]

R counting the occurrences of similar rows of data frame

I have data in the following format called DF (this is just a made up simplified sample):
eval.num, eval.count, fitness, fitness.mean, green.h.0, green.v.0, offset.0 random
1 1 1500 1500 100 120 40 232342
2 2 1000 1250 100 120 40 11843
3 3 1250 1250 100 120 40 981340234
4 4 1000 1187.5 100 120 40 4363453
5 1 2000 2000 200 100 40 345902
6 1 3000 3000 150 90 10 943
7 1 2000 2000 90 90 100 9304358
8 2 1800 1900 90 90 100 284333
However, the eval.count column is incorrect and I need to fix it. It should report the number of rows with the same values for (green.h.0, green.v.0, and offset.0) by only looking at the previous rows.
The example above uses the expected values, but assume they are incorrect.
How can I add a new column (say "count") which will count all previous rows which have the same values of the specified variables?
I have gotten help on a similar problem of just selecting all rows with the same values for specified columns, so I supposed I could just write a loop around that, but it seems inefficient to me.
Ok, let's first do it in the easy case where you just have one column.
> data <- rep(sample(1000, 5),
sample(5, 5))
> head(data)
[1] 435 435 435 278 278 278
Then you can just use rle to figure out the contiguous sequences:
> sequence(rle(data)$lengths)
[1] 1 2 3 1 2 3 4 5 1 2 3 4 1 2 1
Or altogether:
> head(cbind(data, sequence(rle(data)$lengths)))
[1,] 435 1
[2,] 435 2
[3,] 435 3
[4,] 278 1
[5,] 278 2
[6,] 278 3
For your case with multiple columns, there are probably a bunch of ways of applying this solution. Easiest might be to just paste the columns you care about together to form a single vector.
Okay I used the answer I had on another question and worked out a loop that I think will work. This is what I'm going to use:
cmpfun2 <- function(r) {
count <- 0
if (r[1] > 1)
{
for (row in 1:(r[1]-1))
{
if(all(r[27:51] == DF[row,27:51,drop=FALSE])) # compare to row bind
{
count <- count + 1
}
}
}
return (count)
}
brows <- apply(DF[], 1, cmpfun2)
print(brows)
Please comment if I made a mistake and this won't work, but I think I've figured it out. Thanks!
I have a solution I figured out over time (sorry I haven't checked this in a while)
checkIt <- function(bind) {
print(bind)
cmpfun <- function(r) {all(r == heeds.data[bind,23:47,drop=FALSE])}
brows <- apply(heeds.data[,23:47], 1, cmpfun)
#print(heeds.data[brows,c("eval.num","fitness","green.h.1","green.h.2","green.v.5")])
print(nrow(heeds.data[brows,c("eval.num","fitness","green.h.1","green.h.2","green.v.5")]))
}
Note that heeds.data is my actual data frame and I just printed a few columns originally to make sure that it was working (now commented out). Also, 23:47 is the part that needs to be checked for duplicates
Also, I really haven't learned as much R as I should so I'm open to suggestions.
Hope this helps!

Resources