Sum column every n column in a data frame R - r

I have a df(A) with 10 column and 300 row. I need to sum every two column, between them, in this way:
A[,1]+A[,2] = # first result
A[,3]+A[,4] = # second result
A[,5]+A[,6]= # third result
....
A[,9]+A[,10] # last result
The expected final result is a new dataframe with 5 column and 300 row.
Any way to do this? with tapply or loop for?
I know that i can try with the upon example, but i'm looking for a fast method
Thank you

We could use sapply:
df <- data.frame(replicate(expr=rnorm(100),n = 10))
sapply(seq(1,9,by=2),function(i) rowSums(df[,i:(i+1)]))

You can do it without *apply loops.
Sample data:
df <- head(iris[-5])
df
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#1 5.1 3.5 1.4 0.2
#2 4.9 3.0 1.4 0.2
#3 4.7 3.2 1.3 0.2
#4 4.6 3.1 1.5 0.2
#5 5.0 3.6 1.4 0.2
#6 5.4 3.9 1.7 0.4
Now you can use vector recycling of a logicals:
df[c(TRUE,FALSE)] + df[c(FALSE,TRUE)]
# Sepal.Length Petal.Length
#1 8.6 1.6
#2 7.9 1.6
#3 7.9 1.5
#4 7.7 1.7
#5 8.6 1.6
#6 9.3 2.1

It's a bit cryptic but I it should be fast. We add each column to the adjacent column. Then delete the unnecessary results with c(T,F) which recycles through odd columns:
(A[1:(ncol(A)-1)] + A[2:ncol(A)])[c(T,F)]

Related

Multiply value with specific columns

I want to multiply a value (0.045) with specific columns (that start with "i") in a dataset. There is also a column called "id" that has the value 0.045 in all rows.
I've tried this, which did not work:
df %>%
mutate(across(starts_with("i")), ~.id)
The columns to be multiplied can be specified based on position or based on the fact that they all start with "i"
Hope someone can help me.
Thanks a lot!
Magnus
Try this. I used iris dataset in order to create the example. Be careful that the new definition for mutating the columns should be inside across() and not outside it, as you have in the shared code. Here the solution:
library(tidyverse)
#Code
iris %>%
mutate(across(starts_with("Sepal"), ~.*0.045))
Output (some rows):
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 0.2295 0.1575 1.4 0.2 setosa
2 0.2205 0.1350 1.4 0.2 setosa
3 0.2115 0.1440 1.3 0.2 setosa
4 0.2070 0.1395 1.5 0.2 setosa
5 0.2250 0.1620 1.4 0.2 setosa
6 0.2430 0.1755 1.7 0.4 setosa
7 0.2070 0.1530 1.4 0.3 setosa
8 0.2250 0.1530 1.5 0.2 setosa
9 0.1980 0.1305 1.4 0.2 setosa
Base R solution:
cols_bool <- startsWith(names(iris), "Sepal")
cbind(iris[,!cols_bool, drop = FALSE], iris[,cols_bool, drop = FALSE] * 0.045)

How to tidily create multiple columns from sets of columns?

I'm looking to use a non-across function from mutate to create multiple columns. My problem is that the variable in the function will change along with the crossed variables. Here's an example:
needs=c('Sepal.Length','Petal.Length')
iris %>% mutate_at(needs, ~./'{col}.Width')
This obviously doesn't work, but I'm looking to divide Sepal.Length by Sepal.Width and Petal.Length by Petal.Width.
I think your needs should be something which is common in both the columns.
You can select the columns based on the pattern in needs and divide the data based on position. !! and := is used to assign name of the new columns.
library(dplyr)
library(rlang)
needs = c('Sepal','Petal')
purrr::map_dfc(needs, ~iris %>%
select(matches(.x)) %>%
transmute(!!paste0(.x, '_divide') := .[[1]]/.[[2]]))
# Sepal_divide Petal_divide
#1 1.457142857 7.000000000
#2 1.633333333 7.000000000
#3 1.468750000 6.500000000
#4 1.483870968 7.500000000
#...
#...
If you want to add these as new columns you can do bind_cols the above with iris.
Here is a base R approach based that the columns you want to divide have a similar name pattern,
res <- sapply(split.default(iris[-ncol(iris)], sub('\\..*', '', names(iris[-ncol(iris)]))), function(i) i[1] / i[2])
iris[names(res)] <- res
head(iris)
# Sepal.Length Sepal.Width Petal.Length Petal.Width Species Petal.Petal.Length Sepal.Sepal.Length
#1 5.1 3.5 1.4 0.2 setosa 7.00 1.457143
#2 4.9 3.0 1.4 0.2 setosa 7.00 1.633333
#3 4.7 3.2 1.3 0.2 setosa 6.50 1.468750
#4 4.6 3.1 1.5 0.2 setosa 7.50 1.483871
#5 5.0 3.6 1.4 0.2 setosa 7.00 1.388889
#6 5.4 3.9 1.7 0.4 setosa 4.25 1.384615

Smart spreadsheet parsing (managing group sub-header and sum rows, etc)

Say you have a set of spreadsheets formatted like so:
Is there an established method/library to parse this into R without having to individually edit the source spreadsheets? The aim is to parse header rows and dispense with sum rows so the output is the raw data, like so:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 7.0 3.2 4.7 1.4 versicolor
5 6.4 3.2 4.5 1.5 versicolor
6 6.9 3.1 4.9 1.5 versicolor
7 5.7 2.8 4.1 1.3 versicolor
8 6.3 3.3 6.0 2.5 virginica
9 5.8 2.7 5.1 1.9 virginica
10 7.1 3.0 5.9 2.1 virginica
I can certainly hack a tailored solution to this, but wondering there is something a bit more developed/elegant than read.csv and a load of logic.
Here's a reproducible demo csv dataset (can't assume an equal number of lines per group..), although I'm hoping the solution can transpose to *.xlsx:
,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17
There is a variety of ways to present spreadsheets so it would be hard to have a consistent methodology for all presentations. However, it is possible to transform the data once it is loaded in R. Here's an example with your data. It uses the function na.locf from package zoo.
x <- read.csv(text=",Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17", header=TRUE, stringsAsFactors=FALSE)
library(zoo)
x <- x[x$X!="Mean",] #remove Mean line
x$Species <- x$X #create species column
x$Species[grepl("[0-9]",x$Species)] <- NA #put NA if Species contains numbers
x$Species <- na.locf(x$Species) #carry last observation if NA
x <- x[!rowSums(is.na(x))>0,] #remove lines with NA
X Sepal.Length Sepal.Width Petal.Length Petal.Width Species
3 1 5.1 3.5 1.4 0.2 Setosa
4 2 4.9 3.0 1.4 0.2 Setosa
5 3 4.7 3.2 1.3 0.2 Setosa
9 1 7.0 3.2 4.7 1.4 Versicolor
10 2 6.4 3.2 4.5 1.5 Versicolor
11 3 6.9 3.1 4.9 1.5 Versicolor
15 1 6.3 3.3 6.0 2.5 Virginica
16 2 5.8 2.7 5.1 1.9 Virginica
17 3 7.1 3.0 5.9 2.1 Virginica
I just recently did something similar. Here was my solution:
iris <- read.csv(text=",Sepal.Length,Sepal.Width,Petal.Length,Petal.Width
,,,,
Setosa,,,,
1,5.1,3.5,1.4,0.2
2,4.9,3,1.4,0.2
3,4.7,3.2,1.3,0.2
Mean,4.9,3.23,1.37,0.2
,,,,
Versicolor,,,,
1,7,3.2,4.7,1.4
2,6.4,3.2,4.5,1.5
3,6.9,3.1,4.9,1.5
Mean,6.77,3.17,4.7,1.47
,,,,
Virginica,,,,
1,6.3,3.3,6,2.5
2,5.8,2.7,5.1,1.9
3,7.1,3,5.9,2.1
Mean,6.4,3,5.67,2.17", header=TRUE, stringsAsFactors=FALSE)
First I used a which splits at an index.
split_at <- function(x, index) {
N <- NROW(x)
s <- cumsum(seq_len(N) %in% index)
unname(split(x, s))
}
Then you define that index using:
iris[,1] <- stringr::str_trim(iris[,1])
index <- which(iris[,1] %in% c("Virginica", "Versicolor", "Setosa"))
The rest is just using purrr::map_df to perform actions on each data.frame in the list that's returned. You can add some additional flexibility for removing unwanted rows if needed.
split_at(iris, index) %>%
.[2:length(.)] %>%
purrr::map_df(function(x) {
Species <- x[1,1]
x <- x[-c(1,NROW(x) - 1, NROW(x)),]
data.frame(x, Species = Species)
})

Use variable in i of data.table subset

I am new to data.table and think that this is an easy question, but can't seem to find the answer anywhere.
I want to subset a table based on the value of two columns, whose names I know. But I want to compare against a value which I don't know in advance. That is, I want to use a variable for the i portion of DT[]. But I can't seem to figure out how to do it. Everything I see explains how to use a variable for j (i.e. column names), but not for i.
When I just put the name of the variable in, i.e.
setkey(dtpredictions, colA, colB)
nextweek = dtpredictions[J(uservar, weekvar)]
it returns the entire table. Trying to apply the answer to FAQ 1.6, I tried:
nextweek = dtpredictions[J(eval(quote(uservar)), eval(quote(weekvar)))]
and
nextweek = dtpredictions[J(eval(user), eval(week))]
but both still returned the whole table.
I am pretty sure this is very simple, but I am stuck.
EDIT
I apologize for not clarifying earlier: I would like to do a binary search, since I need the speedup. I know that I can do a vector scan using ==, but I would prefer not to.
Found the problem - one of my variables had the same name as a column in the table. I actually saw a question about a similar problem here, but didn't even realize that I had that issue. (It was another column in the table, not the one I was subsetting on.)
I changed the name of the variable I was using to subset and now it works.
hmmm...interesting. Does this code seem to work for you? I am not getting the same error. I am using data.table 1.9.3.
require(data.table)
iris <- data.table(iris)
#Create new categorical variable
set.seed(1)
iris[ , new.var := sample(letters[1:5],150,replace=TRUE)]
#Set keys
setkey(iris,Species,new.var)
#Create variables to reference
check1 <- "setosa"
check2 <- "b"
#Return matches
iris[J(check1,check2)]
And the resulting table:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species new.var
1: 5.1 3.5 1.4 0.2 setosa b
2: 4.9 3.0 1.4 0.2 setosa b
3: 5.0 3.6 1.4 0.2 setosa b
4: 5.4 3.7 1.5 0.2 setosa b
5: 4.3 3.0 1.1 0.1 setosa b
6: 5.7 3.8 1.7 0.3 setosa b
7: 5.1 3.7 1.5 0.4 setosa b
8: 4.8 3.4 1.9 0.2 setosa b
9: 5.0 3.0 1.6 0.2 setosa b
10: 5.2 3.5 1.5 0.2 setosa b
11: 4.7 3.2 1.6 0.2 setosa b
Is this you are looking for?
setkey(dtpredictions, colA, colB)
nextweek <- dtpredictions[colA == uservar & colB == weekvar]

Defining functions (in rollapply) using lines of a dataframe

First of all, I have a dataframe (lets call it "years") with 5 rows and 10 columns. I need to build a new one doing (x1-x2)/x1, being x1 the first element and x2 the second element of a column in "years", then (x2-x3)/x2 and so forth. I thought rollapply would be the best tool for the task, but I can't figure out how to define such function to insert it in rollapply.
I'm new to R, so I hope my question is not too basic. Anyway, I couldn't find a similar question here so I'd be really thankful if someone could help me.
You can use transform, diff and length, no need to use rollapply
> df <- head(iris,5) # some data
> transform(df, New = c(NA, diff(Sepal.Length)/Sepal.Length[-length(Sepal.Length)] ))
Sepal.Length Sepal.Width Petal.Length Petal.Width Species New
1 5.1 3.5 1.4 0.2 setosa NA
2 4.9 3.0 1.4 0.2 setosa -0.03921569
3 4.7 3.2 1.3 0.2 setosa -0.04081633
4 4.6 3.1 1.5 0.2 setosa -0.02127660
5 5.0 3.6 1.4 0.2 setosa 0.08695652
diff.zoo in the zoo package with the arithmetic=FALSE argument will divide each number by the prior in each column:
library(zoo)
as.data.frame(1 - diff(zoo(DF), arithmetic = FALSE))

Resources