How to multiple 2 columns using [nested for loop] in a dataframe in r [duplicate] - r

This question already has answers here:
calculate row sum and product in data.frame
(4 answers)
Closed 5 years ago.
This is really driving me nuts. Say I have 3 dataset I am combining into a dataframe.
a = c(1,2,3,4)
b = c(2,4,6,8)
c = c(3,6,9,12)
d = as.data.frame(cbind(a,b,c)) # combined dataframe
I could just write:
multiply = d[,1]*d[,2]*d[,3]
#[1] 6 48 162 384
But this is not feasible in case I have many columns, so I need nested for- loop statement, so this is what I attempt:
for (col in 1:ncol(d)){
for (j in 1:ncol(d)){
multiply=0
multiply = d[,col]*d[,j]
}
}
print(multiply)
#[1] 9 36 81 144
It just took the column 2 and multiple with column 3. WHY???? Any improvement of my nested for-loopwill be highly appreciated, this is what i am interested to know more about.
Please do not suggest a solution which involves using apply functions. I am already aware of them.

We can use Reduce
Reduce(`*`, d)
#[1] 6 48 162 384
Or with rowProds from library(matrixStats)
library(matrixStats)
rowProds(as.matrix(d))
#[1] 6 48 162 384
If we need a for loop
v1 <- rep(1, nrow(d))
for(j in seq_along(d)){
v1 <- v1*d[,j]
}
v1
#[1] 6 48 162 384

Related

R: replace columns of matrix if vector fulfills certain condition [duplicate]

I have a dataframe as follows
Me You They Him She
1 4 6 3 233
82 0 2 4 122
98 2 5 2 99
I want to get a new dataframe which only contains those columns where the colMeans are >30 so the result should look like
Me She
1 233
82 122
98 99
I tried something like
dfNew<-subset(df,colMeans(df[, 1:ncol(df)]>30))
but got the error
Error in subset.data.frame(df[, 1:ncol(df)]> :
'subset' must be logical
Clearly don't know what Im doing.
try this:
df[,colMeans(df)>30]
I Think this is something that you are looking for
This step is just me creating your data.
Me <- c(1,82,98)
You <- c(4,0,2)
They <- c(6,5,5)
Him <- c(3,4,2)
She <- c(233,122,99)
df <- as.data.frame(cbind(Me, You, They, Him, She))
This is what you want.
df[, sapply(df, mean) > 60]

Find unique rows [duplicate]

This question already has an answer here:
Extracting only unique rows from data frame in R [duplicate]
(1 answer)
Closed 5 years ago.
This seems so simple, but I can't figure it out.
Given this data frame
df=data.frame(
x = c(12,12,165,165,115,148,148,155,155,521),
y = c(54,54,122,122,215,108,108,655,655,151)
)
df
x y
1 12 54
2 12 54
3 165 122
4 165 122
5 115 215
6 148 108
7 148 108
8 155 655
9 155 655
10 521 151
Now, how can I get the rows that only exists once. That is row 5 and 10. The order of rows can be totally arbitrary, so checking for the "next" row is not an option. I tried many things but nothing worked on my data.frame which has ~40k rows.
I had one solution working on a subset (~1k rows) of my data.frame which took 3 minutes to process. Thus, my solution would require 120 minutes on my original data.frame which is not appropiate. Can somebody help?
Check duplicated from the beginning and end of the data frame, if none returns true, then select it:
df[!(duplicated(df) | duplicated(df, fromLast = TRUE)),]
# x y
#5 115 215
#10 521 151
A solution with table
library(dplyr)
table(df) %>% as.data.frame %>% subset(Freq ==1) %>% select(-3)
or with base as you said in comments you prefer not to load packages:
subset(as.data.frame(table(df)),Freq ==1)[,-3]
Also I think data.table is very fast for big data sets and filtering, so this may be worth trying too as you mentionned speed:
df2 <- copy(df)
df2 <- setDT(df2)[, COUNT := .N, by='x,y'][COUNT ==1][,c("x","y")]
A solution using dplyr. df2 is the final output.
library(dplyr)
df2 <- df %>%
count(x, y) %>%
filter(n == 1) %>%
select(-n)
Another base R solution that uses ave to calculate the total number of occurrences for each row and subsets only those that occur 1 time. It could also be modified for subsetting rows that occur a specific number of times.
df[ave(1:NROW(df), df, FUN = length) == 1,]
# x y
#5 115 215
#10 521 151

R lapply(): Change all columns within all data frames in a list to numeric, then convert all values to percentages

Question:
I am a little stumped as to how I can batch process as.numeric() (or any other function for that matter) for columns in a list of data frames.
I understand that I can view specific data frames or colunms within this list by using:
> my.list[[1]]
# or columns within this data frame using:
> my.list[[1]][1]
But my trouble comes when I try to apply this into an lapply() function to change all of the data from integer to numeric.
# Example of what I am trying to do
> my.list[[each data frame in list]][each column in data frame] <-
as.numberic(my.list[[each data frame in list]][each column in data frame])
If you can assist me in any way, or know of any resources that can help me out I would appreciate it.
Background:
My data frames are structured as the below example, where I have 5 habitat types and information on how much area an individual species home range extends to n :
# Example data
spp.1.data <- data.frame(Habitat.A = c(100,45,0,9,0), Habitat.B = c(0,0,203,45,89), Habitat.C = c(80,22,8,9,20), Habitat.D = c(8,59,77,83,69), Habitat.E = c(23,15,99,0,10))
I have multiple data frames with the above structure which I have assigned to a list object:
all.spp.data <- list(spp.1.data, spp.2.data, spp.1.data...n)
I am then trying to coerce all data frames to as.numeric() so I can create data frames of % habitat use i.e:
# data, which is now numeric as per Phil's code ;)
data.numeric <- lapply(data, function(x) {
x[] <- lapply(x, as.numeric)
x
})
> head(data.numeric[[1]])
Habitat.A Habitat.B Habitat.C Habitat.D Habitat.E
1 100 0 80 8 23
2 45 0 22 59 15
3 0 203 8 77 99
4 9 45 9 83 0
5 0 89 20 69 10
EDIT: I would like to sum every row, in all data frames
# Add row at the end of each data frame populated by rowSums()
f <- function(i){
data.numeric[[i]]$Sums <- rowSums(data.numeric[[i]])
data.numeric[[i]]
}
data.numeric.SUM <- lapply(seq_along(data.numeric), f)
head(data.numeric.SUM[[1]])
Habitat.A Habitat.B Habitat.C Habitat.D Habitat.E Sums
1 100 0 80 8 23 211
2 45 0 22 59 15 141
3 0 203 8 77 99 387
4 9 45 9 83 0 146
5 0 89 20 69 10 188
EDIT: This is the code I used to convert values within the data frames to % habitat used
# Used Phil's logic to convert all numbers in percentages
data.numeric.SUM.perc <- lapply(data.numeric.SUM,
function(x) {
x[] <- (x[]/x[,6])*100
x
})
Perc.Habitat.A Perc.Habitat.B Perc.Habitat.C Perc.Habitat.D Perc.Habitat.E
1 47 32 0 6 0
2 0 0 52 31 47
3 38 16 2 6 11
4 4 42 20 57 37
5 11 11 26 0 5
6 100 100 100 100 100
This is still not the most condensed way to do this, but it did the trick for me.
Thank you, Phil, Val and Leo P, for helping with this problem.
I'd do this a bit more explicitly:
all.spp.data <- lapply(all.spp.data, function(x) {
x[] <- lapply(x, as.numeric)
x
})
As a personal preference, this clearly conveys to me that I'm looping over each column in a data frame, and looping over each data frame in a list.
If you really want to do it all with lapply, here's a way to go:
lapply(all.spp.data,function(x) do.call(cbind,lapply(1:nrow(x),function(y) as.numeric(x[,y]))))
This uses a nested lapply call. The first one references the single data.frames to x. The second one references the column index for each x to y. So in the end I can reference each column by x[,y].
Since everything will be split up in single vectors, I'm calling do.call(cbind, ... ) to bring it back to a matrix. If you prefer you could add data.frame() around it to bring it back into the original type.

Avoid using a loop to get sum of rows in R, where I want to start and stop the sum on different columns for each row

I am relatively new to R from Stata. I have a data frame that has 100+ columns and thousands of rows. Each row has a start value, stop value, and 100+ columns of numerical values. The goal is to get the sum of each row from the column that corresponds to the start value to the column that corresponds to the stop value. This is direct enough to do in a loop, that looks like this (data.frame is df, start is the start column, stop is the stop column):
for(i in 1:nrow(df)) {
df$out[i] <- rowSums(df[i,df$start[i]:df$stop[i]])
}
This works great, but it is taking 15 minutes or so. Does anyone have any suggestions on a faster way to do this?
You can do this using some algebra (if you have a sufficient amount of memory):
DF <- data.frame(start=3:7, end=4:8)
DF <- cbind(DF, matrix(1:50, nrow=5, ncol=10))
# start end 1 2 3 4 5 6 7 8 9 10
#1 3 4 1 6 11 16 21 26 31 36 41 46
#2 4 5 2 7 12 17 22 27 32 37 42 47
#3 5 6 3 8 13 18 23 28 33 38 43 48
#4 6 7 4 9 14 19 24 29 34 39 44 49
#5 7 8 5 10 15 20 25 30 35 40 45 50
take <- outer(seq_len(ncol(DF)-2)+2, DF$start-1, ">") &
outer(seq_len(ncol(DF)-2)+2, DF$end+1, "<")
diag(as.matrix(DF[,-(1:2)]) %*% take)
#[1] 7 19 31 43 55
If you are dealing with values of all the same types, you typically want to do things in matrices. Here is a solution in matrix form:
rows <- 10^3
cols <- 10^2
start <- sample(1:cols, rows, replace=T)
end <- pmin(cols, start + sample(1:(cols/2), rows, replace=T))
# first 2 cols of matrix are start and end, the rest are
# random data
mx <- matrix(c(start, end, runif(rows * cols)), nrow=rows)
# use `apply` to apply a function to each row, here the
# function sums each row excluding the first two values
# from the value in the start column to the value in the
# end column
apply(mx, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
# df version
df <- as.data.frame(mx)
df$out <- apply(df, 1, function(x) sum(x[-(1:2)][x[[1]]:x[[2]]]))
You can convert your data.frame to a matrix with as.matrix. You can also run the apply directly on your data.frame as shown, which should still be reasonably fast. The real problem with your code is that your are modifying a data frame nrow times, and modifying data frames is very slow. By using apply you get around that by generating your answer (the $out column), which you can then cbind back to your data frame (and that means you modify your data frame just once).

How to find the difference between 2 dataframes?

I have 2 dataframes which are "exactly" the same. The difference between them is that one has 676 observations (rows) and the second has 666 observations. I don't know which of those rows are missed in a second dataframe.
Would be the easiest to me if someone can show me the code how to make a third dataframe with those 10 rows which are missed.
The name of dataframes:
- dataset1 (676)
- dataset2 (666)
Thx.
dataset1[tail(!duplicated(rbind(dataset2, dataset1)), nrow(dataset1)), ]
Here's an approach:
library(qdap)
## generate random problem
prob <- sample(1:nrow(mtcars), 1)
## remove the random problem row
mtcars2 <- mtcars[-prob, ]
## Throw it into a list of 2 dataframes so they're easier to work with
dat <- list(mtcars, mtcars2)
## Use qdap's `paste2` function to paste all columns together
dat2 <- lapply(dat, paste2)
## Find the shorter data set
wmn <- which.min(sapply(dat2, length))
## Add additional element to shorter one
dat2[[wmn]] <- c(dat2[[wmn]], NA)
## check each element of the 2 pasted data sets for equality
out <- mapply(identical, dat2[[1]], dat2[[2]])
## Which row's the problem
which(!out)[1]
which(!out)[1] == prob
If which(!out)[1] equals NA problem is in the last row.
When you start seeing FALSE that's where the problem is located.
EDIT: removed the for loop
I would say try to use merge and then look for where the merge result has NA values.
Here's an example using dummy data:
set.seed(1)
df1 <- data.frame(x=rnorm(100),y=rnorm(100))
df2 <- df1[-sample(1:100,10),]
dim(df1)
# [1] 100 2
dim(df2)
# [1] 90 2
out <- merge(df1,df2,by='x',all.x=TRUE)
in1not2 <- which(is.na(out$y.y))
in1not2
# [1] 6 25 33 51 52 53 57 73 77 82
Then you can extract:
> df1[in1not2,]
x y
6 -0.8204684 1.76728727
25 0.6198257 -0.10019074
33 0.3876716 0.53149619
51 0.3981059 0.45018710
52 -0.6120264 -0.01855983
53 0.3411197 -0.31806837
57 -0.3672215 1.00002880
73 0.6107264 0.45699881
77 -0.4432919 0.78763961
82 -0.1351786 0.98389557

Resources