R: Create Integer Array from Two Columns, then Plot Histogram - r

I have a dataframe with 2 numeric columns. For each row, I want to create an array of integers that fall between the values in the columns, and that includes the values in the column. Then, I want to compile all of the values into a single column to generate a histogram.
Input:
df
C1 C2
A 3 -92
B 8 -162
C 20 -97
D 50 -76
Output:
sdf5$Values
-92
-91
-90
...
2
3
-162
-161
...
7
8
...
My actual dataframe has 62 rows. My current code gives me frequencies > 100 (should have a maximum of 62 for any integer). The code worked on a dummy dataframe, so I'm not sure where things are going wrong.
list <- mapply(":", df$C2, df$C1)
df3 <- do.call(rbind.data.frame, list)
sdf3 <- stack(df3)
sdf4 <- as.data.frame(sdf3$values)
sdf5 <- rename(sdf4, Values = 1)
a <- ggplot(sdf5, aes(x=Values)) +
geom_histogram(binwidth = 1, center=0)

I'm not sure what exactly goes wrong, but I think the rbind.data.frame is causing some troubles with the list input. As an alternative:
library(ggplot2)
df <- read.table(text = " C1 C2
A 3 -92
B 8 -162
C 20 -97
D 50 -76")
list <- mapply(":", df$C2, df$C1)
df2 <- data.frame(Values = do.call(c, list))
ggplot(df2, aes(x=Values)) +
geom_histogram(binwidth = 1, center=0)
Created on 2021-02-08 by the reprex package (v1.0.0)

There must be something off going on with the stack function, you can check it using table. To put all list numbers into a single vector I'd use unlist.
df=data.frame(C1=floor(runif(80,0,200)),C2=floor(runif(80,-200,0)))
list <- mapply(":", df$C2, df$C1)
df3 <- do.call(rbind.data.frame, list)
sdf3 <- stack(df3)
sdf4 <- data.frame("Values"=sdf3$values)
table(sdf4)
# This returns the count of each unique value and some go up to 200,
# notably the limits of my unif distribution
If you use unlist, it gives the desired result.
df=data.frame(C1=floor(runif(80,0,200)),C2=floor(runif(80,-200,0)))
list <- mapply(":", df$C2, df$C1)
vec <- data.frame("Values"=unlist(list))
a <- ggplot(vec, aes(x=Values)) +
geom_histogram(binwidth = 1, center=0)
I don't know the stack function, but the problem must be there somehow.

Related

Index and assign multiple sets of rows at once

I have an imported dataframe Measurements that contains many observations from an experiment.
Measurements <- data.frame(X = 1:4,
Data = c(90, 85, 100, 105))
X Data
1 90
2 85
3 100
4 105
I want to add another column Condition that specifies the treatment group for each datapoint. I know which obervation ranges are from which condition (e.g. observations 1:2 are from the control and observations 3:4 are from the experimental group).
I have devised two solutions already that give the desired output but neither are ideal. First:
Measurements["Condition"] <- c(rep("Cont", 2), rep("Exp", 2))
X Data Condition
1 90 Cont
2 85 Cont
3 100 Exp
4 105 Exp
The benefit of this is it is one line of code/one command. But this is not ideal since I need to do math outside separately (e.g. 3:4 = 2 obs, etc) which can be tricky/unclear/indirect with larger datasets and more conditions (e.g. 47:83 = ? obs, etc) and would be liable to perpetuating errors since a small error in length for an early assignment would also shift the assignment of later groups (e.g. if rep of Cont is mistakenly 1, then Exp gets mistakenly assigned to 2:3 too).
I also thought of assigning like this, which gives the desired output too:
Measurements[1:2, "Condition"] <- "Cont"
Measurements[3:4, "Condition"] <- "Exp"
X Data Condition
1 90 Cont
2 85 Cont
3 100 Exp
4 105 Exp
This makes it more clear/simple/direct which rows will receive which assignment, but this requires separate assignments and repetition. I feel like there should be a way to "vectorize" this assignment, which is the solution I'm looking for.
I'm having trouble finding complex indexing rules from online. Here is my first intuitive guess of how to achieve this:
Measurements[c(1:2, 3:4), "Condition"] <- list("Cont", "Exp")
X Data Condition
1 90 Cont
2 85 Cont
3 100 Cont
4 105 Cont
But this doesn't work. It seems to combine 1:2 and 3:4 into a single equivalent range (1:4) and assigns only the first condition to this range, which suggests I also need to specify the column again. When I try to specify the column again:
Measurements[c(1:2, 3:4), c("Condition", "Condition")] <- list("Cont", "Exp")
X Data Condition Condition.1
1 90 Cont Exp
2 85 Cont Exp
3 100 Cont Exp
4 105 Cont Exp
For some reason this creates a second new column (??), and it again seems to combine 1:2 and 3:4 into essentially 1:4. So I think I need to index the two row ranges in a way that keeps them separate and only specify the column once, but I'm stuck on how to do this. I assume the solution is simple but I can't seem to find an example of what I'm trying to do. Maybe to keep them separate I do have to assign them separately, but I'm hoping there is a way.
Can anyone help? Thank you a ton in advance from an R noobie!
If you already have a list of observations which belong to each condition you could use dplyr::case_when to do a conditional mutate. Depending on how you have this information stored you could use something like the following:
library(dplyr)
Measurements <- data.frame(X = 1:4,
Data = c(90, 85, 100, 105))
# set which observations belong to each condition
Cont <- 1:2
Exp <- 3:4
Measurements %>%
mutate(Condition = case_when(
X %in% Cont ~ "Cont",
X %in% Exp ~ "Exp"
))
# X Data Condition
# 1 90 Cont
# 2 85 Cont
# 3 100 Exp
# 4 105 Exp
Note that this does not require the observations to be in consecutive rows.
I normally see this done with a merge operation. The trick is getting your conditions data into a nice shape.
composeConditions <- function(...) {
conditions <- list(...)
data.frame(
X = unname(unlist(conditions)),
condition = unlist(unname(lapply(
names(conditions),
function(x) rep(x, times = length(conditions[x][[1]]))
)))
)
}
conditions <- composeConditions(Cont = 1:2, Exp = 3:4)
> conditions
X condition
1 1 Cont
2 2 Cont
3 3 Exp
4 4 Exp
merge(Measurements, conditions, by = "X")
X Data condition
1 1 90 Cont
2 2 85 Cont
3 3 100 Exp
4 4 105 Exp
Efficient for larger datasets is to know the data pattern and the data id.
Measurements <- data.frame(X = 1:4, Data = c(90, 85, 100, 105))
dat <- c("Cont","Exp")
pattern <- c(1,1,2,2)
Or draw pattern from data, e.g. conditional from Measurements$Data
pattern <- sapply( Measurements$Data >=100, function(x){ if(x){2}else{1} } )
# [1] 1 1 2 2
Then you can add the data simply by doing:
Measurements$Condition <- dat[pattern]
# X Data Condition
#1 1 90 Cont
#2 2 85 Cont
#3 3 100 Exp
#4 4 105 Exp

Faster Way to Create a Subset within a Loop or Apply Function in R

I'm new to R, so apologies in advance for bad form in my code.
I'm trying to figure out the best way to go through a dataframe, row by row, and modify a value based on logic that references other columns within that row or an entirely different dataframe. The issue is that the logic I'm using necessitates creating and subsetting a dataframe for each row to retrieve a minimum value. My real data set is 47000 rows and 15 columns, so creating 47,000 subsets is taking a long time.
Here are sample datasets to help describe what I'm talking about.
df1 <- data.frame('A' = c(rep("Beer", 2), rep("Chip", 2)), 'B' = c(NA, 3,
NA,9), 'C' = 5:8, 'D' = NA)
df2 <- data.frame('Q' = c(rep("Beer", 2), rep("Chip", 2)), 'R' = 6:9, 'S' =
c(12, 15, 4, 18), 'T' = c(23, 45, 75, 34))
df1:
A B C D
Beer NA 5 NA
Beer 3 6 NA
Chip NA 7 NA
Chip 9 8 NA
df2:
Q R S T
Beer 6 12 23
Beer 7 15 45
Chip 8 4 75
Chip 9 18 34
This loop does what I want, namely checking whether a value is NA in column B or not, if it isn't then use that value in for column D, if it is NA then retrieve the minimum value from a filtered subset of df2. In the real use case I have other filtering conditions.
require(dplyr)
for (i in 1:nrow(df1)) {
if (!(is.na(df1$B[i]))) {
df1$D[i] <- df1$B[i]}
else {x <- filter(df2, df1$A[i] == df2$Q)
x <- min(x$S)
df1$D[i] <- x
}
}
Everyone says to avoid loops in R, so I created this function using apply which also works (although is a little more difficult to follow):
FUNC <- function(x) {
apply(x, 1, function(y) {
if (!(is.na(y[2]))) {
y[4] <- y[2]}
else {z <- filter(df2, y[1] == df2$Q)
z <- min(z$S)
y[4] <- z}
}
)
}
df1$D <- as.numeric(FUNC(df1))
Output:
A B C D
Beer NA 5 12
Beer 3 6 3
Chip NA 7 4
Chip 9 8 9
Aside question: is there a way to reference items in vector y by name instead of by index position?
So is there a better way to do this? Right now both methods take about 5-8 minutes to run through 47,000+ rows which seems long to me.
df1$D <- df2 %>%
rename(A=Q) %>%
group_by(A) %>%
summarise(D=min(S)) %>%
right_join(df1, by="A") %>%
mutate(D=ifelse(is.na(B), D.x, B)) %>%
`[[`("D")

R: Summing rows in a loop based on rowname

I am new to R and would find some tips very helpful.
I have populated matrix X that has lists of rownames which are numeric.
These correspond to matrix (Y).
I would like to summate all the rows in matrix Y based on the rownames in Matrix X.
So X[,1] may contain a list of rownames which I want to extract the row sums of these particular rows in matrix Y.
I think where I'm having difficulty is where to put the rownames() in the statements - I've tried many different combinations using functions, with and if. Any guidance or tips would be very gratefully received. Thank you.
I have provided a simplified version of the problem below:
X Y
1 2 10 10 10
3 3 20 20 20
5 4 30 30 30
40 40 40
50 50 50
Z[1] (X[,1]) should equal [10+10+10]+[30+30+30]+[50+50+50]
Z[2] (X[,2]) should equal [20+20+20]+[30+30+30]+[40+40+40]
Z should be a vector of sums of Y's rows depending on the column of X's row name values.
You can achieve this as follows:
x <- data.frame(x)
sapply(x, function(r) sum(y[r, ]))
Output is:
X1 X2
270 270
Alternatively, you can name columns of matrix x and supply them to sapply. In this case, I went with easy conversion of x to data frame.
A solution based on data.table and reshape2 packages:
library(data.table)
library(reshape2)
X <- matrix(c(1,3,5,2,3,4), nrow = 3, ncol = 2)
Y <- 10*matrix(rep(1:5, each = 3), nrow = 5, byrow = TRUE)
# Convert to data.table
X.DT <- data.table(X)
Y.DT <- data.table(Y)
Z.DT <-
# First melt the X to get the column names as grouping 'variable'
# and the numeric values in 'value'
melt(X.DT, measure.vars = names(X.DT))[
# Sum the values of Y selected by the indicies stored in X
, .(Z = sum(Y.DT[value]))
, by = variable
]
Z.DT
Result looks like this:
variable Z
1: V1 270
2: V2 270
And if you need the result as a simple vector Z then you can do it like this:
Z <- Z.DT[,Z]
Z
[1] 270 270
For reference, the intermediary data.table that is returned by the melt function looks like this:
> melt(X.DT, measure.vars = names(X.DT))
variable value
1: V1 1
2: V1 3
3: V1 5
4: V2 2
5: V2 3
6: V2 4

Expand two large data files and apply using data.table?

I am attempting to apply a function to two data sets df1 and df2 where df1 contains (a, b) and can be 1 million rows long, and df2 contains (x, y, z) and can be very large, anywhere from ~100 to >10,000. I would like to apply a function foo over every combination of both data sets and then sum over the second data set.
foo <- function(a, b, x, y, z) a + b + x + y + z
df1 <- data.frame(a = 1:10, b = 11:20)
df2 <- data.frame(x= 1:5, y = 21:25, z = 31:35)
The code I am using to apply this function (taken from #jlhoward here How to avoid multiple loops with multiple variables in R)
foo.new <- function(p1, p2) {
p1 = as.list(p1); p2 = as.list(p2)
foo(p1$a, p1$b, p2$x, p2$y, p2$z)
}
indx <- expand.grid(indx2 = seq(nrow(df2)), indx1 = seq(nrow(df1)))
result <- with(indx, foo.new(df1[indx1, ], df2[indx2, ]))
sums <- aggregate(result, by = list(rep(seq(nrow(df1)), each = nrow(df2))), sum)
However, as df2 gets large (>1000) I quickly run out of memory to perform the result function above (running 64bit PC with 32GB RAM).
I have read about data.table quite a bit but can't evaluate whether there is a function in there that would assist in saving memory. Something that would replace with and create a smaller file at the result step, or expand.grid at the index step, which creates the largest file by far.
Here is a data.table solution: should be pretty fast:
library(data.table)
indx<-CJ(indx1=seq(nrow(df2)),indx2=seq(nrow(df1))) #CJ is data.table function for expand.grid
indx[,`:=`(result=foo.new(df1[indx1, ], df2[indx2, ]),Group.1=rep(seq(nrow(df1)), each = nrow(df2)))][,.(sums=sum(result)),by=Group.1]
Group.1 sums
1: 1 355
2: 2 365
3: 3 375
4: 4 385
5: 5 395
6: 6 405
7: 7 415
8: 8 425
9: 9 435
10: 10 445

Aggregate over categories that contain NAs with ddply and lapply?

I would like to aggregate a data.frame over 3 categories, with one of them varying. Unfortunately this one varying category contains NAs (actually it's the reason why it needs to vary). Thus I created a list of data.frames. Every data.frame within this list contains only complete cases with respect to three variables (with only one of them changing).
Let's reproduce this:
library(plyr)
mydata <- warpbreaks
names(mydata) <- c("someValue","group","size")
mydata$category <- c(1,2,3)
mydata$categoryA <- c("A","A","X","X","Z","Z")
# add some NA
mydata$category[c(8,10,19)] <- NA
mydata$categoryA[c(14,1,20)] <- NA
# create a list of dfs that contains TRUE FALSE
noNAList <- function(vec){
res <- !is.na(vec)
return(res)
}
testTF <- lapply(mydata[,c("category","categoryA")],noNAList)
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
return(res)
}
# check x and see that it may contain NAs as long
# as it's not in one of the 3 categories I want to aggregate over
x <-lapply(testTF,selectDF)
## let's ddply get to work
doddply <- function(df){
ddply(df,.(group,size),summarize,sumTest = sum(someValue))
}
y <- lapply(x, doddply);y
y comes very close to what I want to get
$category
group size sumTest
1 A L 375
2 A M 198
3 A H 185
4 B L 254
5 B M 259
6 B H 169
$categoryA
group size sumTest
1 A L 375
2 A M 204
3 A H 200
4 B L 254
5 B M 259
6 B H 169
But I need to implement aggregation over a third varying variable, which is in this case category and categoryA. Just like:
group size category sumTest sumTestTotal
1 A H 1 46 221
2 A H 2 46 221
3 A H 3 93 221
and so forth. How can I add names(x) to lapply, or do I need a loop or environment here?
EDIT:
Note that I want EITHER category OR categoryA added to the mix. In reality I have about 15 mutually exclusive categorical vars.
I think you might be making this really hard on yourself, if I understand your question correctly.
If you want to aggregate the data.frame 'myData' by three (or four) variables, you would simply do this:
aggregate(someValue ~ group + size + category + categoryA, sum, data=mydata)
group size category categoryA someValue
1 A L 1 A 51
2 B L 1 A 19
3 A M 1 A 17
4 B M 1 A 63
aggregate will automatically remove rows that include NA in any of the categories. If someValue is sometimes NA, then you can add the parameter na.rm=T.
I also noted that you put a lot of unnecessary code into functions. For example:
# create a list of data.frames
selectDF <- function(TFvec){
res <- mydata[TFvec,]
return(res)
}
Can be written like:
selectDF <- function(TFvec) mydata[TFvec,]
Also, using lapply to create a list of two data frames without the NA is overkill. Try this code:
x = list(mydata[!is.na(mydata$category),],mydata[!is.na(mydata$categoryA),])
I know the question explicitly requests a ddply()/lapply() solution.
But ... if you are willing to come on over to the dark side, here is a data.table()-based function that should do the trick:
# Convert mydata to a data.table
library(data.table)
dt <- data.table(mydata, key = c("group", "size"))
# Define workhorse function
myfunction <- function(dt, VAR) {
E <- as.name(substitute(VAR))
dt[i = !is.na(eval(E)),
j = {n <- sum(.SD[,someValue])
.SD[, list(sumTest = sum(someValue),
sumTestTotal = n,
share = sum(someValue)/n),
by = VAR]
},
by = key(dt)]
}
# Test it out
s1 <- myfunction(dt, "category")
s2 <- myfunction(dt, "categoryA")
ADDED ON EDIT
Here's how you could run this for a vector of different categorical variables:
catVars <- c("category", "categoryA")
ll <- lapply(catVars,
FUN = function(X) {
do.call(myfunction, list(dt, X))
})
names(ll) <- catVars
lapply(ll, head, 3)
# $category
# group size category sumTest sumTestTotal share
# [1,] A H 2 46 185 0.2486486
# [2,] A H 3 93 185 0.5027027
# [3,] A H 1 46 185 0.2486486
#
# $categoryA
# group size categoryA sumTest sumTestTotal share
# [1,] A H A 79 200 0.395
# [2,] A H X 68 200 0.340
# [3,] A H Z 53 200 0.265
Finally, I found a solution that might not be as slick as Josh' but it works without no dark forces (data.table). You may laugh – here's my reproducible example using the same sample data as in the question.
qual <- c("category","categoryA")
# get T / F vectors
noNAList <- function(vec){
res <- !is.na(vec)
return(res)
}
selectDF <- function(TFvec) mydata[TFvec,]
NAcheck <- lapply(mydata[,qual],noNAList)
# create a list of data.frames
listOfDf <- lapply(NAcheck,selectDF)
workhorse <- function(charVec,listOfDf){
dfs <- list2env(listOfDf)
# create expression list
exlist <- list()
for(i in 1:length(qual)){
exlist[[qual[i]]] <- parse(text=paste("ddply(",qual[i],
",.(group,size,",qual[i],"),summarize,sumTest = sum(someValue))",
sep=""))
}
res <- lapply(exlist,eval,envir=dfs)
return(res)
}
Is this more like what you mean? I find your example extremely difficult to understand. In the below code, the method can take any column, and then aggregate by it. It can return multiple aggregation functions of someValue. I then find all the column names you would like to aggregate by, and then apply the function to that vector.
# Build a method to aggregate by column.
agg.by.col = function (column) {
by.list=list(mydata$group,mydata$size,mydata[,column])
names(by.list) = c('group','size',column)
aggregate(mydata$someValue, by=by.list, function(x) c(sum=sum(x),mean=mean(x)))
}
# Find all the column names you want to aggregate by
cols = names(mydata)[!(names(mydata) %in% c('someValue','group','size'))]
# Apply the method to each column name.
lapply (cols, agg.by.col)

Resources