sum up certain variables (columns) by variable names - r

i want to sum up certain variables (columns in a data frame).
I would like to select those variables by parts of their names.
The complex thing is that i have various conditions. So, using a single contains from dplyr does not work.
Here is an example:
ab_yy <- c(1:5)
bc_yy <- c(5:9)
cd_yy <- c(2:6)
de_xx <- c(3:7)
ab_yy bc_yy cd_yy de_xx
1 1 5 2 3
2 2 6 3 4
3 3 7 4 5
4 4 8 5 6
5 5 9 6 7
dat <- data.frame(ab_yy,bc_yy,cd_yy,de_xx)
#sum up all variables that contain yy and certain extra conditions
#may look something like this: rowSums(select(dat, contains(("yy&ab")|("yy&bc")) ) )
desired result:
6 8 10 12 14

EDIT: Fixed, sorry, low on caffeine
If you want to use dplyr, try using matches:
library(dplyr)
dat %>%
select(matches("*yy", )) %>%
select(matches("ab*|bc*")) %>%
rowSums(.)
[1] 6 8 10 12 14

I don't think that it's the best way but u can do it like that with a grepl:
rowSums(dat[,grepl(pattern = "ab.*yy|bc.*yy",colnames(dat))==T])

Related

mutate string into numeric, ignore alphabetical order of factor

I am trying to recode factor levels into numbers using mutate function, but I want to ignore alphabetical order the factors are appearing in. There are multiple same values of factor levels and I want them to be assigned the number in the new column of the row in which they first appeared in the dataframe.
Example:
library(stringi)
set.seed(234)
data<-stri_rand_strings(20,1)
data<-as.data.frame(data)
data2<-data %>% mutate(num=(as.numeric(factor(data))))
data2
Expected outcome:
dat<-data2[,-2]
order<-c(1,2,3,2,4,5)
expected_result<-cbind.data.frame(head(dat), order)
expected_result
I think you can just create a new factor and set the levels as unique values of data2$data in your example:
new_fac <- factor(data2$data, levels = unique(data2$data))
The numeric values can be obtained:
new_order <- as.numeric(new_fac)
And this is what your final result would look like:
head(data.frame(new_fac, new_order))
new_fac new_order
1 k 1
2 m 2
3 1 3
4 m 2
5 4 4
6 d 5
Or in your example with dplyr, you can do:
data %>%
mutate(num = as.numeric(factor(data, levels = unique(data))))
You could accomplish this with a helper table that contains the row number of the first time a string appears in your table. I.e.
library(stringi)
library(tidyverse)
# generate data
data<-stri_rand_strings(20,1)
data<-as.data.frame(data)
Create helper table:
factorlevels <- data %>% unique() %>% mutate(order = row_number())
... and inner join to data
data %>% inner_join(factorlevels)
Output:
> data %>% inner_join(factorlevels)
Joining, by = "data"
data order
1 k 1
2 m 2
3 1 3
4 m 2
5 4 4
6 d 5
7 v 6
8 i 7
9 v 6
10 H 8
11 Y 9
12 X 10
13 a 11
14 a 11
15 0 12
16 R 13
17 J 14
18 j 15
19 8 16
20 s 17
I am sure that there is a one-liner approach to this, but I could not figure it out right away.

Using Strings to Identify Sequence of Column Names in R

I am currently try to use pre-defined strings in order to identify multiple column names in R.
To be more explicit, I am using the ave function to create identification variables for subgroups of a dataframe. The twist is that I want the identification variables to be flexible, in such a manner that I would just pass it as a generic string.
A sample code would be:
ids = with(df,ave(rep(1,nrow(df)),subcolumn1,subcolumn2,subcolumn3,FUN=seq_along))
I would like to run this code in the following fashion (code below does not work as expected):
subColumnsString = c("subcolumn1","subcolumn2","subcolumn3")
ids = with(df,ave(rep(1,nrow(df)),subColumnsString ,FUN=seq_along))
I tried something with eval, but still did not work:
subColumnsString = c("subcolumn1","subcolumn2","subcolumn3")
ids = with(df,ave(rep(1,nrow(df)),eval(parse(text=subColumnsString)),FUN=seq_along))
Any ideas?
Thanks.
EDIT: Working code example of what I want:
df = mtcars
id_names = c("vs","am")
idDF_correct = transform(df,idItem = as.numeric(interaction(vs,am)))
idDF_wrong = cbind(df,ave(rep(1,nrow(df)),df[id_names],FUN=seq_along))
Note how in idDF_correct, the unique combinations are correctly mapped into unique values of idItem. In idDF_wrong this is not the case.
I think this achieves what you requested. Here I use the mtcars dataset that ships with R:
subColumnsString <- c("cyl","gear")
ids = with(mtcars, ave(rep(1,nrow(mtcars)), mtcars[subColumnsString], FUN=seq_along))
Just index your data.frame using the sub columns which returns a list that naturally works with ave
EDIT
ids = ave(rep(1,nrow(mtcars)), mtcars[subColumnsString], FUN=seq_along)
You can omit the with and just call plain 'ol ave, as G. Grothendieck, stated and you should also use their answer as it is much more general.
This defines a function whose arguments are:
data, the input data frame
by, a character vector of column names in data
fun, a function to use in ave
Code--
Ave <- function(data, by, fun = seq_along) {
do.call(function(...) ave(rep(1, nrow(data)), ..., FUN = fun), data[by])
}
# test
Ave(CO2, c("Plant", "Treatment"), seq_along)
giving:
[1] 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3
[39] 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6 7 1 2 3 4 5 6
[77] 7 1 2 3 4 5 6 7

R: Transposing from long to wide and aggregating rows with matching ID

This is something I've been working around for a while just making separate data frames and doing full_join but I think there's an easier way.
Overall, I'm wanting to calculate the differences between an individual ID's value from time 1 to time 2 by type from a long form data frame. This is one of the ways I think I could do it but if other people have other techniques or ideas I'd like to hear them too.
However, I'd also like to know how to address this transposing issue anyway because I'm curious.
Here's my issue.
I have a data frame in long form with 5 different measures for two different time periods. I want to convert this data frame from long form into a wide form so that instead of having a DF look like this (note, not all types are included -- just did 2 for sake of length):
(example df1)
ID Time Value Type
1 1 7 Type1
1 2 8 Type1
2 1 9 Type1
2 2 10 Type1
1 1 13 Type2
1 2 15 Type2
2 1 17 Type2
2 2 19 Type2
I want it to look more like this:
(example df 2)
ID Type1.1 Type1.2 Type2.1 Type2.2
1 7 8 13 15
2 9 10 17 19
I use:
library(dplyr)
library(tidyr)
df.new <- df %>%
spread(Type, Measurement.Value)
and get this from example df 1 which is on the right track:
(example df 3)
ID Time Type1 Type2
1 1 7 13
1 2 8 15
2 1 9 17
2 2 10 19
But now I want to spread the time for each type. When I do something like this on example df3:
newer.df <- df.new %>%
spread(Time, Type1)
to make this:
ID Type1.1 Type1.2
1 7 NA
1 NA 8
2 9 NA
2 NA 10
So, it's producing an NA for each row -- is there a way I can collapse rows on to each other by ID? I think I'm missing something.
Remember, in my example code I'm only using 2 types but in reality I have 5 types -- just wanted to give simplified code.
We can use dcast() from reshape2 package.
library(reshape2)
dcast(df, ID ~ Type + Time, value.var = "Value")
# ID Type1_1 Type1_2 Type2_1 Type2_2
#1 1 7 8 13 15
#2 2 9 10 17 19
Or using the original tidyr package, we could do this:
library(tidyr)
df$Type <- paste(df$Type, df$Time, sep="_")
df$Time <- NULL
spread(df, key=Type, value=Value)
ID Type1_1 Type1_2 Type2_1 Type2_2
1 7 8 13 15
2 9 10 17 19
Nulling the time column did the trick for me. It seems that spread considers all columns not used otherwise as what dcast would call id.vars. There might be a more elegant solution using tidyr, though.

Take the subsets of a data.frame with the same feature and select a single row from each subset

Suppose I have a matrix in R as follows:
ID Value
1 10
2 5
2 8
3 15
4 7
4 9
...
What I need is a random sample where every element is represented once and only once.
That means that ID 1 will be chosen, one of the two rows with ID 2, ID 3 will be chosen, one of the two rows with ID 4, etc...
There can be more than two duplicates.
I'm trying to figure out the most R-esque way to do this without subsetting and sampling the subsets?
Thanks!
tapply across the rownames and grab a sample of 1 in each ID group:
dat[tapply(rownames(dat),dat$ID,FUN=sample,1),]
# ID Value
#1 1 10
#3 2 8
#4 3 15
#6 4 9
If your data is truly a matrix and not a data.frame, you can work around this too, with:
dat[tapply(as.character(seq(nrow(dat))),dat$ID,FUN=sample,1),]
Don't be tempted to remove the as.character, as sample will give unintended results when there is only one value passed to it. E.g.
replicate(10, sample(4,1) )
#[1] 1 1 4 2 1 2 2 2 3 4
You can do that with dplyr like so:
library(dplyr)
df %>% group_by(ID) %>% sample_n(1)
The idea is reorder the rows randomly and then remove duplicates in that order.
df <- read.table(text="ID Value
1 10
2 5
2 8
3 15
4 7
4 9", header=TRUE)
df2 <- df[sample(nrow(df)), ]
df2[!duplicated(df2$ID), ]

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Resources