How can I get column data to be added based on a group designation using R? - r

The data set that I'm working with is similar to the one below (although the example is of a much smaller scale, the data I'm working with is 10's of thousands of rows) and I haven't been able to figure out how to get R to add up column data based on the group number. Essentially I want to be able to get the number of green(s), blue(s), and red(s) added up for all of group 81 and 66 separately and then be able to use that information to calculate percentages.
txt <- "Group Green Blue Red Total
81 15 10 21 46
81 10 10 10 30
81 4 8 0 12
81 42 2 2 46
66 11 9 1 21
66 5 14 5 24
66 7 5 2 14
66 1 16 3 20
66 22 4 2 28"
dat <- read.table(textConnection(txt), sep = " ", header = TRUE)
I've spent a good deal of time trying to figure out how to use some of the functions on my own hoping I would stumble across a proper way to do it, but since I'm such a new basic user I feel like I have hit a wall that I cannot progress past without help.

One way is via aggregate. Assuming your data is in an object x:
aggregate(. ~ Group, data=x, FUN=sum)
# Group Green Blue Red Total
# 1 66 46 48 13 107
# 2 81 71 30 33 134

Both of the answers above are perfect examples of how to address this type of problem. Two other options exist within reshape and plyr
library(reshape)
cast(melt(dat, "Group"), Group ~ ..., sum)
library(plyr)
ddply(dat, "Group", function(x) colSums(x[, -1]))

I would suggest that #Joshua's answer is neater, but two functions you should learn are apply and tapply. If a is your data set, then:
## apply calculates the sum of each row
> total = apply(a[,2:4], 1, sum)
## tapply calculates the sum based on each group
> tapply(total, a$Group, sum)
66 81
107 134

Related

multiplying columns in R

I have a data frame like this.
> abc
ID 1.x 2.x 1.y 2.y
1 4 10 20 30 40
2 16 5 10 5 10
3 42 16 17 18 19
4 91 20 20 20 20
5 103 103 42 56 84
How do I create two additional columns '1' and '2' by multiplying 1.x * 1.y and 2.x * 2.y in a generalized way?
I am trying to get a generalized solution where number of columns can be too many. So I want to multiply all x with all y. While x and y are fixed, n has to be figured out from data frame.
For simplicity lets assume n is also fixed however it is a large number.
One thing i can try is :-
abc[,c(6,7)]=abc[,c(2,3)]*abc[,c(4,5)]
It will work only if col positions are contiguous. This is good enough for me. If anyone can have more generalized solution, it will benefit us all.
If there are only couple of variables to multiply, we can do this with transform by multiplying the columns of interest
transform(abc, new1 = `1.x`*`1.y`, new2 = `2.x`*`2.y`, check.names = FALSE)
# ID 1.x 2.x 1.y 2.y new1 new2
#1 4 10 20 30 40 300 800
#2 16 5 10 5 10 25 100
#3 42 16 17 18 19 288 323
#4 91 20 20 20 20 400 400
#5 103 103 42 56 84 5768 3528
If we have lots of columns, then one approach is to split the dataset into a list of data.frames by taking the substring of names and then loop through the list and multiply the rows with do.call
abc[paste0("new", 1:2)] <- lapply(split.default(abc[-1],
sub("\\.[a-z]+$", "", names(abc)[-1])), function(x) do.call(`*`, x))
Or another option is (based on the pairwise column multiplication)
apply(aperm(array(unlist(abc[-1]), c(5, 2, 2)),
c(3, 1, 2)), 3, matrixStats::colProds)
Mutate will preserve the original variables. Mutate_all will allow you to multiply all columns in your dataframe.
abc %>%
mutate(new_vary1 = `1.x`* `2.x`,
new_vary2 = `1.y`* `2.y`) %>%
mutate_all(funs(.*`1.x`))

use dplyr mutate() in programming

I am trying to assign a column name to a variable using mutate.
df <-data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(name){
df%>%mutate(name = ifelse(x <50, "small", "big"))
}
When I run
new(name = "newVar")
it doesn't work. I know mutate_() could help but I'm struggling in using it together with ifelse.
Any help would be appreciated.
Using dplyr 0.7.1 and its advances in NSE, you have to UQ the argument to mutate and then use := when assigning. There is lots of info on programming with dplyr and NSE here: https://cran.r-project.org/web/packages/dplyr/vignettes/programming.html
I've changed the name of the function argument to myvar to avoid confusion. You could also use case_when from dplyr instead of ifelse if you have more categories to recode.
df <- data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(myvar){
df %>% mutate(UQ(myvar) := ifelse(x < 50, "small", "big"))
}
new(myvar = "newVar")
This returns
x y newVar
1 37 1.82669 small
2 63 -0.04333 big
3 46 0.20748 small
4 93 0.94169 big
5 83 -0.15678 big
6 14 -1.43567 small
7 61 0.35173 big
8 26 -0.71826 small
9 21 1.09237 small
10 90 1.99185 big
11 60 -1.01408 big
12 70 0.87534 big
13 55 0.85325 big
14 38 1.70972 small
15 6 0.74836 small
16 23 -0.08528 small
17 27 2.02613 small
18 76 -0.45648 big
19 97 1.20124 big
20 99 -0.34930 big
21 74 1.77341 big
22 72 -0.32862 big
23 64 -0.07994 big
24 53 -0.40116 big
25 16 -0.70226 small
26 8 0.78965 small
27 34 0.01871 small
28 24 1.95154 small
29 82 -0.70616 big
30 77 -0.40387 big
31 43 -0.88383 small
32 88 -0.21862 big
33 45 0.53409 small
34 29 -2.29234 small
35 54 1.00730 big
36 22 -0.62636 small
37 100 0.75193 big
38 52 -0.41389 big
39 36 0.19817 small
40 89 -0.49224 big
41 81 -1.51998 big
42 18 0.57047 small
43 78 -0.44445 big
44 49 -0.08845 small
45 20 0.14014 small
46 32 0.48094 small
47 1 -0.12224 small
48 66 0.48769 big
49 11 -0.49005 small
50 87 -0.25517 big
Following the dlyr programming vignette, define your function as follows:
new <- function(name)
{
nn <- enquo(name) %>% quo_name()
df %>% mutate( !!nn := ifelse(x <50, "small", "big"))
}
enquo takes its expression argument and quotes it, followed by quo_name converting it into a string. Since nn is now quoted, we need to tell mutate not to quote it a second time. That's what !! is for. Finally, := is a helper operator to make it valid R code. Note that with this definition, you can simply pass newVar instead of "newVar" to your function, maintaining dplyr style.
> new( newVar ) %>% head
x y newVar
1 94 -1.07642088 big
2 85 0.68746266 big
3 80 0.02630903 big
4 74 0.18323506 big
5 86 0.85086915 big
6 38 0.41882858 small
Base R solution
df <-data.frame(x = sample(1:100, 50), y = rnorm(50))
new <- function(name){
df[,name]='s'
df[,name][df$x>50]='b'
return(df)
}
I am using dplyr 0.5 so i just combine base R with mutate
new <- function(Name){
df=mutate(df,ifelse(x <50, "small", "big"))
names(df)[3]=Name
return(df)
}
new("newVar")

Looping through rows, creating and reusing multiple variables

I am building a streambed hydrology calculator in R using multiple tables from an Access database. I am having trouble automating and calculating the same set of indices for multiple sites. The following sample dataset describes my data structure:
> Thalweg
StationID AB0 AB1 AB2 AB3 AB4 AB5 BC1 BC2 BC3 BC4 Xdep_Vdep
1 1AAUA017.60 47 45 44 55 54 6 15 39 15 11 18.29
2 1AXKR000.77 30 27 24 19 20 18 9 12 21 13 6.46
3 2-BGU005.95 52 67 62 42 28 25 23 26 11 19 20.18
4 2-BLG011.41 66 85 77 83 63 35 10 70 95 90 67.64
5 2-CSR003.94 29 35 46 14 19 14 13 13 21 48 6.74
where each column represents certain field-measured parameters (i.e. depth of a reach section) and each row represents a different site.
I have successfully used the apply functions to simultaneously calculate simple functions on multiple rows:
> Xdepth <- apply(Thalweg[, 2:11], 1, mean) # Mean Depth
> Xdepth
1 2 3 4 5
33.1 19.3 35.5 67.4 25.2
and appending the results back to the proper station in a dataframe.
However, I am struggling when I want to calculate and save variables that are subsequently used for further calculations. I cannot seem to loop or apply the same function to multiple columns on a single row and complete the same calculations over the next row without mixing variables and data.
I want to do:
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + other_variables), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + other_variables), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + other_variables), Thalweg$AB3)
# etc.
Depth_AB0 <- (Thalweg$AB0 - Residual_AB0)
Depth_AB1 <- (Thalweg$AB1 - Residual_AB1)
Depth_AB2 <- (Thalweg$AB2 - Residual_AB2)
# etc.
I have tried and subsequently failed at for loops such as:
for (i in nrow(Thalweg)){
Residual_AB0 <- min(Xdep_Vdep, Thalweg$AB0)
Residual_AB1 <- min((Residual_AB0 + Stacks_Equation), Thalweg$AB1)
Residual_AB2 <- min((Residual_AB1 + Stacks_Equation), Thalweg$AB2)
Residual_AB3 <- min((Residual_AB2 + Stacks_Equation), Thalweg$AB3)
Residuals <- data.frame(Thalweg$StationID, Residual_AB0, Residual_AB1, Residual_AB2, Residual_AB3)
}
Is there a better way to approach looping through multiple lines of data when I need unique variables saved for each specific row that I am currently calculating? Thank you for any suggestions.
your exact problem is still a mistery to me...
but it looks like you want a double for loop
for(i in 1:nrow(thalweg)){
residual=thalweg[i,"Xdep_Vdep"]
for(j in 2:11){
residual=min(residual,thalweg[i,j])
}
}

R: How to divide a data frame by column values?

Suppose I have a data frame with 3 columns and 10 rows as follows.
# V1 V2 V3
# 10 24 92
# 13 73 100
# 25 91 120
# 32 62 95
# 15 43 110
# 28 54 84
# 30 56 71
# 20 82 80
# 23 19 30
# 12 64 89
I want to create sub-dataframes that divide the original by the values of V1.
For example,
the first data frame will have the rows with values of V1 from 10-14,
the second will have the rows with values of V1 from 15-19,
the third from 20-24, etc.
What would be the simplest way to make this?
So if this is your data
dd<-data.frame(
V1=c(10,13,25,32,15,38,30,20,23,13),
V2=c(24,73,91,62,43,54,56,82,19,64),
V3=c(92,100,120,95,110,84,71,80,30,89)
)
then the easiest way to split is using the split() command. And since you want to split in ranges, you can use the cut() command to create those ranges. A simple split can be done with
ss<-split(dd, cut(dd$V1, breaks=seq(10,35,by=5)-1)); ss
split returns a list where each item is the subsetted data.frame. So to get at the data.frame with the values for 10-14, use ss[[1]], and for 15-19, use ss[[2]] etc.

Cumulative count of values in R

I hope you are doing very well. I would like to know how to calculate the cumulative sum of a data set with certain conditions. A simplified version of my data set would look like:
t id
A 22
A 22
R 22
A 41
A 98
A 98
A 98
R 98
A 46
A 46
R 46
A 46
A 46
A 46
R 46
A 46
A 12
R 54
A 66
R 13
A 13
A 13
A 13
A 13
R 13
A 13
Would like to make a new data set where, for each value of "id", I would have the cumulative number of times that each id appears , but when t=R I need to restart the counting e.g.
t id count
A 22 1
A 22 2
R 22 0
A 41 1
A 98 1
A 98 2
A 98 3
R 98 0
A 46 1
A 46 2
R 46 0
A 46 1
A 46 2
A 46 3
R 46 0
A 46 1
A 12 1
R 54 0
A 66 1
R 13 0
A 13 1
A 13 2
A 13 3
A 13 4
R 13 0
A 13 1
Any ideas as to how to do this? Thanks in advance.
Using rle:
out <- transform(df, count = sequence(rle(do.call(paste, df))$lengths))
out$count[out$t == "R"] <- 0
If your data.frame has more than these two columns, and you want to check only these two columns, then, just replace df with df[, 1:2] (or) df[, c("t", "id")].
If you find do.call(paste, df) dangerous (as #flodel comments), then you can replace that with:
as.character(interaction(df))
I personally don't find anything dangerous or clumsy with this setup (as long as you have the right separator, meaning you know your data well). However, if you do find it as such, the second solution may help you.
Update:
For those who don't like using do.call(paste, df) or as.character(interaction(df)) (please see the comment exchanges between me, #flodel and #HongOoi), here's another base solution:
idx <- which(df$t == "R")
ww <- NULL
if (length(idx) > 0) {
ww <- c(min(idx), diff(idx), nrow(df)-max(idx))
df <- transform(df, count = ave(id, rep(seq_along(ww), ww),
FUN=function(y) sequence(rle(y)$lengths)))
df$count[idx] <- 0
} else {
df$count <- seq_len(nrow(df))
}

Resources