Aggregate a dataframe based on three columns in R - r

I have a dataframe with a structure like this :
V1 V2 V3 V4
1 1.35 A 10241297 10459084
2 16.00 A 10241297 10459084
3 1.47 A 10241297 10459084
I would like to average V1 based on V2, V3 and V4
All aggregate example I saw are dealing with aggregating based on a single value.
Any help is appreciated
Thanks

This is one of the ways you could accomplish what I think you are looking for (hard to tell because you have one set of unique identifiers in the example data):
aggregate( V1 ~ V2 + V3 + V4 , df , mean )
# V2 V3 V4 V1
# 1 A 10241297 10459084 6.273333

Here is a plyr approach.
library(plyr)
ddply(df,.(V2,V3,V4),summarise,V1=mean(V1))
V2 V3 V4 V1
1 A 10241297 10459084 6.273333

Related

Calculating weight ratios in the presence of empty cells

I have a sample which needs to weighed in order to represent the population.
library(data.table)
sample <- fread("
1,0,2,2
3,4,3,0
")
V1 V2 V3 V4
1: 1 0 2 2
2: 3 4 3 0
population <- fread("
10,20,20,10
30,40,20,10
")
This weight would simply be:
weights <- population/sample
V1 V2 V3 V4
1: 10 Inf 10.000000 5
2: 10 10 6.666667 Inf
However, because V2 in row 1 of the sample has no observations, it receives an infinite weight (Note that also V4 in row 2 receives an Inf, but this is easier to solve, because the weight is irrelevant, as there are no observations in either the sample or the population).
A solution to the problem, would be to count V1 and V2 together in the sample and the population.
EDIT:
After some thought I realised that, for the weights to be correct, only the population values have to be adapted. If V1 and V2 in row 1 of population are added together in V1 of population, this will already lead to the correct weight for the sample observation of V1 row . The value of V2 becomes irrelevant because there is no observation in the sample to receive that weight.
End of EDIT
The observation would then get a weight of:
(population[1,1]+population[1,2])/(sample[1,1]+sample[1,2])
(10+20)/(1+0)=30
In my actual data, there however many more rows, with hero and there a 0 in the sample. I am trying to figure out if there is a way to write my code, so that I do not have to do this manually..
Desired outcome (notice that the weight of V1 row 1 is now 30):
weights
V1 V2 V3 V4
1: 30 0 10.000000 5
2: 10 10 6.666667 0
Attempt
I was think of doing something like:
for (i in seq_along(ncol(sample))) {
lapply(population, (ifelse(sample[i]==0), population[i]<-population[i+1], population[i])
}
Where the values in the population of the cell to right will be added when the value in the sample is zero. However I am having trouble getting the syntax right, and even if it did, it does not solve the case where V4 is 0.
Here is a rather verbose solution. In case there are more columns that should be aggregated in case of zeros in sample, I would have proposed a more flexible approach but this seems sufficient for your example
library(data.table)
sample <- fread("
1,0,2,2
3,4,3,0
")
population <- fread("
10,20,20,10
30,40,20,10
")
# aggregate Values if sample is zero
population[sample$V1 == 0, `:=`(V1 = 0,
V2 = V1 + V2)]
population[sample$V2 == 0, `:=`(V1 = V1 + V2,
V2 = 0)]
weights <- population/sample
# Fix NaNs
weights[is.na(weights), ] <- 0
weights
#> V1 V2 V3 V4
#> 1: 30 0 10.000000 5
#> 2: 10 10 6.666667 Inf

convert a list of class data frame objects into objects of class matrix - list

I have a list of data frames. Each data frame has 6 rows and 6 columns. They are all numbers, however, all data frames have their elements as class character.
Example:
$`A`
V1 V2 V3 V4 V5 V6
V1 0.1212 0.6231 0.4431 0.3213 0.6578 0.1259
V2 2.1234 0.6532 0.9845 0.8743 0.8732
V3 0.2314 0.7648 0.7634 0.8732
V4 0.1234 0.6544 0.3456
V5 0.7653 0.9812
V6 0.1265
$`B`
V1 V2 V3 V4 V5 V6
V1 0.2345 0.1234 0.5647 0.7891 0.6721 0.3259
V2 1.1334 0.4332 0.1245 0.2343 0.5332
V3 0.2914 0.1648 0.2334 0.1232
V4 0.1234 0.6744 0.5656
V5 0.3553 0.9812
V6 0.4665
I would like to change all data frames of the list to class matrix (numerical).
I tried:
lapply (list, data.matrix)
but the result is a list of data frames with integers. Example:
V1 V2 V3 V4 V5 V6
V1 2 2 2 2 2 4
V2 1 3 4 5 5 7
V3 1 1 3 4 6 3
V4 1 1 1 3 4 5
V5 1 1 1 1 1 1
V6 1 1 1 1 1 1
Also tried to run
lapply(list, as.matrix)
however, I got a list of quoted matrices, like this:
$`A`
V1 V2 V3 V4 V5 V6
V1 "0.1212" "0.6231" "0.4431" "0.3213" "0.6578" "0.1259"
V2 "2.1234" "0.6532" "0.9845" "0.8743" "0.8732"
V3 "0.2314" "0.7648" "0.7634" "0.8732"
V4 "0.1234" "0.6544" "0.3456"
V5 "0.7653" "0.9812"
V6 "0.1265"
How can I convert these data frames of my list from character class to matrix class?
We may loop over the list, then loop over the data.frame columns with lapply convert to numeric and assign it back to the original data.frame object and return the data.frame ('x')
list <- lapply(list, function(x) {x[] <- lapply(x, as.numeric);x})
If those are factor columns, convert to character first and then to numeric
lapply(list, function(x) {x[] <- lapply(x, function(y) as.numeric(as.character(y)))
x})
You can convert to numeric and then reset the matrix order:
lapply(dfs, function(x) matrix(as.numeric(x), ncol = n_cols))
Data
set.seed(1L)
n_cols <- 6
n_total <- 36
a <- matrix(rnorm(n_total), ncol = n_cols)
b <- matrix(rnorm(n_total), ncol = n_cols)
a[lower.tri(a)] <- ""
b[lower.tri(b)] <- ""
dfs <- list(a, b)

Replace specific values in a data frame except first column

I have this line in one my function - result[result>0.05] <- "", that replaces all values from my data frame grater than 0.05, including the row names from the first column. How to avoid this?
This is a fast way too:
df <- as.data.frame(matrix(runif(100),nrow=10))
df[-1][df[-1]>0.05] <- ''
Output:
> df
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
1 0.60105471
2 0.63340567
3 0.11625581
4 0.96227379 0.0173133104108274
5 0.07333583
6 0.05474430 0.0228175506927073
7 0.62610309
8 0.76867090
9 0.76684615 0.0459537433926016
10 0.83312158

R extracting data conditionally?

I have a simulation dataset that explores a set of parameter space, and each set of parameter are run multiple times (iterations), it looks like so:
p1 p2 p3 iteration result
=================================
v3 v2 v1 1 23.8
v2 v1 v3 2 20.36
v3 v2 v1 2 28.8
v2 v1 v3 1 29.36
...
As can be seen from this example, both (v3, v2, v1) and (v2, v1, v3) are run twice. I am trying to extract only the rows with max result for each parameter setting, in this example:
only row 3 and 4 should be kept, as they represent the best results from that parameter set. Is there a easy way to accomplish that in R? Thanks
df <- read.table(textConnection("p1 p2 p3 iteration result
v3 v2 v1 1 23.8
v2 v1 v3 2 20.36
v3 v2 v1 2 28.8
v2 v1 v3 1 29.36"), header = T)
library(plyr)
ddply(df, .(p1,p2,p3), function(x) return(x[(which(x$result == max(x$result))), ]))
p1 p2 p3 iteration result
1 v2 v1 v3 1 29.36
2 v3 v2 v1 2 28.80

How to swap values between two columns

I have a data frame with three variables and 250K records. As an example consider
df <- data.frame(V1=c(1,2,4), V2=c("a","a","b"), V3=c(2,3,1))
V1 V2 V3
1 a 2
2 a 3
4 b 1
and want to swap values between V1 and V3 based on the value of V2 as follows:
if V2 == 'b' then V1 <- V3 and V3 <- V1
resulting in
V1 V2 V3
1 a 2
2 a 3
1 b 4
I tried a do loop but it takes forever. If I use Perl, it takes seconds. I believe this task can be done efficiently in R as well. Any suggestions are appreciated.
Try this
df <- data.frame(V1=c(1,2,4), V2=c("a","a","b"), V3=c(2,3,1))
df[df$V2 == "b", c("V1", "V3")] <- df[df$V2 == "b", c("V3", "V1")]
which yields:
> df
V1 V2 V3
1 1 a 2
2 2 a 3
3 1 b 4
You can use transform to do this.
df <- transform(df, V3 = ifelse(V2 == 'b', V1, V3), V1 = ifelse(V2 == 'b', V3, V1))
Editted I got tripped up with column names, sorry. This works.
If you don't mind the rows ending up in different orders, this is kind of a 'cute' way to do this:
dat <- read.table(textConnection("V1 V2 V3
1 a 2
2 a 3
4 b 1"),sep = "",header = TRUE)
tmp <- dat[dat$V2 == 'b',3:1]
colnames(tmp) <- colnames(dat)
rbind(dat[dat$V2 != 'b',],tmp)
Basically, that's just grabbing the rows where V2 == 'b', reverses the columns and slaps it back together with everything else. This can be extended if you have more columns that don't need switching; you'd just use an integer index with those values transposed, rather than just 3:1.

Resources