R summing certain columns in each row - r

I'm having an issue, but I'm sure it's super easy for someone who is very familiar with R.
I have a matrix that is 3008 x 3008. What I want is it to sum every 8 columns in each row. So essentially you'd end up with a new matrix that is now 367 x 367.
Here's a small example:
C.1 C.2 C.3 C.4 C.5 C.6
row1 1 2 1 2 5 6
row1 1 2 3 4 5 6
row1 2 6 3 4 5 6
row1 1 2 3 4 10 6
So say I wanted to sum these for every 3 columns in each row, I'd want to end up with:
C.1 C.2
row1 4 13
row1 6 15
row1 11 15
row1 6 20

# m is your matrix
n <- 8
grp <- seq(1, ncol(m), by=n)
sapply(grp, function(x) rowSums(m[, x:(x+n-1)]))
Some explanation if you're new to R. grp is a sequence of numbers that gives the starting points for each group of columns: 1, 9, 17, etc if you want to sum every 8 columns.
The sapply call can be understood as follows. For each number in grp, it calls the rowSums function, passing it those matrix columns corresponding to that group number. Thus when grp is 1, it gets the row sums for columns 1-8; when grp is 9, it gets the row sums for columns 9-16 and so on. These are vectors, which sapply then binds together into a matrix.

Transform your matrix to an array, then use apply and rowSums.
mat <- structure(c(1L, 1L, 2L, 1L, 2L, 2L, 6L, 2L, 1L, 3L, 3L, 3L, 2L, 4L, 4L, 4L, 5L, 5L, 5L, 10L, 6L, 6L, 6L, 6L),
.Dim = c(4L, 6L),
.Dimnames = list(c("row1", "row2", "row3", "row4"), c("C.1", "C.2", "C.3", "C.4", "C.5", "C.6")))
n <- 3 #this needs to be a factor of the number of columns
a <- array(mat,dim=c(nrow(mat),n,ncol(mat)/n))
apply(a,3,rowSums)
# [,1] [,2]
# [1,] 4 13
# [2,] 6 15
# [3,] 11 15
# [4,] 6 20

#Create sample data:
df <- matrix(rexp(200, rate=.1), ncol=20)
#Choose the number of columns you'd like to sum up (e.g., 3 or 8)
number_of_columns_to_sum <- 3
df2 <- NULL #Set to null so that you can use cbind on the first value below
for (i in seq(1,ncol(df), by = number_of_columns_to_sum)) {
df2 <- cbind(df2, rowSums(df[,i:(i+number_of_columns_to_sum-1)]))
}

Another option: Though it may not be as elegant
mat <- structure(c(1L, 1L, 2L, 1L, 2L, 2L, 6L, 2L, 1L, 3L, 3L, 3L, 2L, 4L, 4L, 4L, 5L, 5L, 5L, 10L, 6L, 6L, 6L, 6L),
.Dim = c(4L, 6L),
.Dimnames = list(c("row1", "row1", "row1", "row1"), c("C.1", "C.2", "C.3", "C.4", "C.5", "C.6")))
new<- data.frame((mat[,1]+mat[,2]+mat[,3]),(mat[,4]+mat[,5]+mat[,6]))
names(new)<- c("C.1","C.2")
new

Related

How to make "For loop" based on column

I have been using "For loops" before this. But the variable is usually k that refer to row numbers.
Example:
for (k in 1:n) {
expression
}
My question is, is it possible for the variable to be a certain column?
Example:
for ("column no" in 1:n) {
expression
}
I have had several trials and errors and a bit stuck now. Here is my data:
date mold no
22-May 1.35436 1
23-May 0.88592 1
24-May 0.81316 1
25-May 0.80856 1
26-May 0.84646 1
27-May 0.81762 1
28-May 0.79828 1
03-Jan 1.09158 2
04-Jan 0.86661 2
05-Jan 0.81908 2
06-Jan 0.7555 2
07-Jan 0.66577 2
08-Jan 0.66706 2
09-Jan 0.67133 2
05-Feb 20.4366 3
06-Feb 5.77923 3
06-Feb 3.12323 3
05-Feb 2.25436 3
06-Feb 1.74551 3
06-Feb 1.52744 3
05-Feb 1.45483 3
28-Jul 1.55148 4
29-Jul 1.18882 4
30-Jul 1.10595 4
31-Jul 1.14101 4
01-Aug 1.1453 4
02-Aug 1.10113 4
03-Aug 1.09152 4
30-Nov 8.3254 5
01-Dec 4.03003 5
02-Dec 2.18026 5
03-Dec 1.40028 5
04-Dec 1.02901 5
05-Dec 0.85859 5
06-Dec 0.7776 5
I would like to as R to sum up the values in the mold column for each group (1 to 5) in the no column. For example, for no=1, it would be
1.35436 + 0.88592 + 0.81316 + 0.80856 + 0.84646 + 0.81762 + 0.79828 = 6.32436
Then repeat for no = 2, 3, 4 etc.
We can loop through the unique elements, compare (==) and get the sumof the 'mold' elements that correspond to the boolean vector
un1 <- unique(df1$no)
v1 <- numeric(length(un1))
for(i in seq_along(v1)) v1[i] <- sum(df1$mold[df1$no== un1[i]])
v1
#[1] 6.32436 5.53693 36.32120 8.32521 18.60117
It is the same as rowsum
rowsum(df1$mold, df1$no)[,1]
# 1 2 3 4 5
# 6.32436 5.53693 36.32120 8.32521 18.60117
data
df1 <- structure(list(date = c("22-May", "23-May", "24-May", "25-May",
"26-May", "27-May", "28-May", "03-Jan", "04-Jan", "05-Jan", "06-Jan",
"07-Jan", "08-Jan", "09-Jan", "05-Feb", "06-Feb", "06-Feb", "05-Feb",
"06-Feb", "06-Feb", "05-Feb", "28-Jul", "29-Jul", "30-Jul", "31-Jul",
"01-Aug", "02-Aug", "03-Aug", "30-Nov", "01-Dec", "02-Dec", "03-Dec",
"04-Dec", "05-Dec", "06-Dec"), mold = c(1.35436, 0.88592, 0.81316,
0.80856, 0.84646, 0.81762, 0.79828, 1.09158, 0.86661, 0.81908,
0.7555, 0.66577, 0.66706, 0.67133, 20.4366, 5.77923, 3.12323,
2.25436, 1.74551, 1.52744, 1.45483, 1.55148, 1.18882, 1.10595,
1.14101, 1.1453, 1.10113, 1.09152, 8.3254, 4.03003, 2.18026,
1.40028, 1.02901, 0.85859, 0.7776), no = c(1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L, 5L, 5L, 5L, 5L, 5L, 5L)),
class = "data.frame", row.names = c(NA,
-35L))

Need help replacing values with NA when another condition is met in R (i.e. when another variable is a specific value)

I'm trying to delete some repeating information in my data set and replace it with NA. Here's an example of the data:
DataTable1
ID Day x y
1 1 1 3
1 2 1 3
2 1 2 5
2 2 2 5
3 1 3 4
3 2 3 4
4 1 4 6
4 2 4 6
I'm trying to replace "x" and "y" values with "NA" when Day=1. This is what I want:
ID Day x y
1 1 NA NA
1 2 1 3
2 1 NA NA
2 2 2 5
3 1 NA NA
3 2 3 4
4 1 NA NA
4 2 4 6
I'm not really sure where to start or how to go about this. I tried using the replace_with_na_if function from the naniar library. Otherwise, I am unsure what to try.
replace_with_na_if(data.frame=DataTable1$x,
condition=DataTable1$Day== 2)
I received an error message that reads:
Error in replace_with_na_if(data.frame = DataTable1$x, condition = DataTable1$Day == :
unused argument (data.frame = DataTable1$x)
An option in base R would be to create a logical vector based on the elements of 'Day'. Use that index to subset the 'x', 'y' columns and assign them to NA
i1 <- df1$Day == 1
df1[i1, c('x', 'y')] <- NA
Here's a data.table solution. Since you may be new to R, you need to install the data.table package first. If you have a large data set, data.table may work faster than using data frame. Also, I find the syntax to be easy to read and understand.
#Create the data frame:
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Day = c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), x = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), y = c(3L, 3L, 5L, 5L,
4L, 4L, 6L, 6L)), class = "data.frame", row.names = c(NA, -8L))
library(data.table)
dt <- setDT(df) # convert the data frame to a data.table
dt[Day == 1, c("x","y") := NA] # where Day equals 1, make the columns x and y equal NA
Good luck and welcome to stackoverflow!
Using dplyr, we can use mutate_at and replace like
library(dplyr)
df %>% mutate_at(vars(x, y), ~replace(., Day == 1, NA))
# ID Day x y
#1 1 1 NA NA
#2 1 2 1 3
#3 2 1 NA NA
#4 2 2 2 5
#5 3 1 NA NA
#6 3 2 3 4
#7 4 1 NA NA
#8 4 2 4 6
data
df <- structure(list(ID = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), Day = c(1L, 2L, 1L,
2L, 1L, 2L, 1L, 2L), x = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L), y = c(3L, 3L, 5L, 5L,
4L, 4L, 6L, 6L)), class = "data.frame", row.names = c(NA, -8L))

R: iterate over columns and plot

My data looks like this:
Group Feature_A Feature_B Feature_C Feature_D
1 1 0 3 2 4
2 1 5 2 2 8
3 1 9 8 6 5
4 2 5 7 8 8
5 2 2 6 8 1
6 2 3 8 6 4
7 3 1 5 3 5
8 3 1 4 3 4
df <- structure(list(Group = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L), Feature_A = c(0L,
5L, 9L, 5L, 2L, 3L, 1L, 1L), Feature_B = c(3L, 2L, 8L, 7L, 6L,
8L, 5L, 4L), Feature_C = c(2L, 2L, 6L, 8L, 8L, 6L, 3L, 3L), Feature_D = c(4L,
8L, 5L, 8L, 1L, 4L, 5L, 4L)), .Names = c("Group", "Feature_A",
"Feature_B", "Feature_C", "Feature_D"), class = "data.frame", row.names = c(NA,
-8L))
For every Feature I want to generate a plot (e.g., boxplot) that would higlight difference between Groups.
# Get unique Feature and Group
Features<-unique(colnames(df[,-1]))
Group<-unique(colnames(df$Group))
But how can I do the rest?
Pseudo-code might look like this:
Select Feature from Data
Split Data according Group
Boxplot
for (i in 1:levels(df$Features)){
for (o in 1:length(Group)){
}}
How can I achieve this? Hope someone can help me.
I would put py data in the long format. Then Using ggplot2 you can do some nice things.
library(reshape2)
library(ggplot2)
library(gridExtra)
## long format using Group as id
dat.m <- melt(dat,id='Group')
## bar plot
p1 <- ggplot(dat.m) +
geom_bar(aes(x=Group,y=value,fill=variable),stat='identity')
## box plot
p2 <- ggplot(dat.m) +
geom_boxplot(aes(x=factor(Group),y=value,fill=variable))
## aggregate the 2 plots
grid.arrange(p1,p2)
This is easy to do. I do this all the time
The code below will generate the charts using ggplot and save them as ch_Feature_A ....
you can wrap the answer in a pdf statement to send them to pdf as well
library(ggplot2)
df$Group <- as.factor(df$Group)
for (i in 2:dim(df)[2]) {
ch <- ggplot(df,aes_string(x="Group",y=names(df)[i],fill="Group"))+geom_boxplot()
assign(paste0("ch_",names(df)[i]),ch)
}
or even simpler, if you do not want separate charts
library(reshape2)
df1 <- melt(df)
ggplot(df1,aes(x=Group,y=value,fill=Group))+geom_boxplot()+facet_grid(.~variable)

Checking row format of csv

I am trying to import some data (below) and checking to see if I have the appropriate number of rows for later analysis.
repexample <- structure(list(QueueName = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), .Label = c(" Overall", "CCM4.usci_retention_eng", "usci_helpdesk"
), class = "factor"), X8Tile = structure(c(1L, 2L, 3L, 4L, 5L,
6L, 7L, 8L, 9L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L,
9L), .Label = c(" Average", "1", "2", "3", "4", "5", "6", "7",
"8"), class = "factor"), Actual = c(508.1821504, 334.6994838,
404.9048759, 469.4068667, 489.2800416, 516.5744106, 551.7966176,
601.5103783, 720.9810622, 262.4622533, 250.2777778, 264.8281938,
272.2807882, 535.2466968, 278.25, 409.9285714, 511.6635101, 553,
641, 676.1111111, 778.5517241, 886.3666667), Calls = c(54948L,
6896L, 8831L, 7825L, 5768L, 7943L, 5796L, 8698L, 3191L, 1220L,
360L, 454L, 406L, 248L, 11L, 9L, 94L, 1L, 65L, 9L, 29L, 30L),
Pop = c(41L, 6L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 3L, 1L, 1L,
1L, 11L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L)), .Names = c("QueueName",
"X8Tile", "Actual", "Calls", "Pop"), class = "data.frame", row.names = c(NA,
-22L))
The data gives 5 columns and is one example of some data that I would typically import (via a .csv file). As you can see there are three unique values in the column "QueueName". For each unique value in "QueueName" I want to check that it has 9 rows, or the corresponding values in the column "X8Tile" ( Average, 1, 2, 3, 4, 5, 6, 7, 8). As an example the "QueueName" Overall has all of the necessary rows, but usci_helpdesk does not.
So my first priority is to at least identify if one of the unique values in "QueueName" does not have all of the necessary rows.
My second priority would be to remove all of the rows corresponding to a unique "QueueName" that does not meet the requirements.
Both these priorities are easily addressed using the Split-Apply-Combine paradigm, implemented in the plyr package.
Priority 1: Identify values of QueueName which don't have enough rows
require(plyr)
# Make a short table of the number of rows for each unique value of QueueName
rowSummary <- ddply(repexample, .(QueueName), summarise, numRows=length(QueueName))
print(rowSummary)
If you have lots of unique values of QueueName, you'll want to identify the values which are not equal to 9:
rowSummary[rowSummary$numRows !=9, ]
Priority 2: Eliminate rows for which QueueNamedoes not have enough rows
repexample2 <- ddply(repexample, .(QueueName), transform, numRows=length(QueueName))
repexampleEdit <- repexample2[repexample2$numRows ==9, ]
print(repxampleEdit)
(I don't quite understand the meaning of 'check that it has 9 rows, or the corresponding values in the column "X8Tile"). You could edit the repexampleEdit line based on your needs.
This is an approach that makes some assumptions about how your data are ordered. It can be modified (or your data can be reordered) if the assumption doesn't fit:
## Paste together the values from your "X8tile" column
## If all is in order, you should have "Average12345678"
## If anything is missing, you won't....
myMatch <- names(
which(with(repexample, tapply(X8Tile, QueueName, FUN=function(x)
gsub("^\\s+|\\s+$", "", paste(x, collapse = ""))))
== "Average12345678"))
## Use that to subset...
repexample[repexample$QueueName %in% myMatch, ]
# QueueName X8Tile Actual Calls Pop
# 1 Overall Average 508.1822 54948 41
# 2 Overall 1 334.6995 6896 6
# 3 Overall 2 404.9049 8831 5
# 4 Overall 3 469.4069 7825 5
# 5 Overall 4 489.2800 5768 5
# 6 Overall 5 516.5744 7943 5
# 7 Overall 6 551.7966 5796 5
# 8 Overall 7 601.5104 8698 5
# 9 Overall 8 720.9811 3191 5
# 14 CCM4.usci_retention_eng Average 535.2467 248 11
# 15 CCM4.usci_retention_eng 1 278.2500 11 2
# 16 CCM4.usci_retention_eng 2 409.9286 9 2
# 17 CCM4.usci_retention_eng 3 511.6635 94 2
# 18 CCM4.usci_retention_eng 4 553.0000 1 1
# 19 CCM4.usci_retention_eng 5 641.0000 65 1
# 20 CCM4.usci_retention_eng 6 676.1111 9 1
# 21 CCM4.usci_retention_eng 7 778.5517 29 1
# 22 CCM4.usci_retention_eng 8 886.3667 30 1
Similar approaches can be taken with aggregate+merge and similar tools.

reshape: cast oddity

Either it's late, or I've found a bug, or cast doesn't like colnames with "." in them. This all happens inside a function, but it "doesn't work" outside of a function as much as it doesn't work inside of it.
x <- structure(list(df.q6 = structure(c(1L, 1L, 1L, 11L, 11L, 9L,
4L, 11L, 1L, 1L, 2L, 2L, 11L, 5L, 4L, 9L, 4L, 4L, 1L, 9L, 4L,
10L, 1L, 11L, 9L), .Label = c("a", "b", "c", "d", "e", "f", "g",
"h", "i", "j", "k"), class = "factor"), df.s5 = structure(c(4L,
4L, 1L, 2L, 4L, 4L, 4L, 3L, 4L, 1L, 2L, 1L, 2L, 4L, 1L, 3L, 4L,
2L, 2L, 4L, 4L, 4L, 2L, 2L, 1L), .Label = c("a", "b", "c", "d",
"e"), class = "factor")), .Names = c("df.q6", "df.s5"), row.names = c(NA,
25L), class = "data.frame")
cast(x, df.q6 + df.s5 ~., length)
No worky.
However, if:
colnames(x) <- c("variable", "value")
cast(x, variable + value ~., length)
Works like a charm.
For me I use a similar solution to what Spacedman points out.
#take your data.frame x with it's two columns
#add a column
x$value <- 1
#apply your cast verbatim
cast(x, df.q6 + df.s5 ~., length)
df.q6 df.s5 (all)
1 a a 2
2 a b 2
3 a d 3
4 b a 1
5 b b 1
6 d a 1
7 d b 1
8 d d 3
9 e d 1
10 i a 1
11 i c 1
12 i d 2
13 j d 1
14 k b 3
15 k c 1
16 k d 1
Hopefully that helps!
Jay
Nothing to do with the dots in the colnames (easily shown!).
If your dataframe doesnt have a column called 'value' then cast() guesses what column is the value - in this case it guesses 'df.s5' as it is the last column. This is what you get when you melt() data. It then renames that column to 'value' before calling reshape1. Now the column 'df.s5' is no more, yet it's there on the left of your formula. Uh oh.
You are using the value in the formula, which is an odd thing to do. None of the cast examples do that. What are you trying to do here?
You could add an ad-hoc column as a dummy value:
> cast(cbind(x,1), df.q6+s5~., length)
Using 1 as value column. Use the value argument to cast to override this choice
df.q6 s5 (all)
1 a a 2
2 a b 2
3 a d 3
4 b a 1
5 b b 1
[etc]
But I suspect there's a better way to get the number of repeated observations (rows) in a data frame - which is your real question!
if you are looking for an easy solution, dcast in reshape2 package can help you:
library(reshape2)
dcast(x, df.q6 + df.s5 ~., length)

Resources