Summing up specific entries in subset group (R programming) - r

So basically I have this format of data:
ID Value
1 32
5 231
2 122
1 11
3 ...
2 ...
5 ...
6 ...
2 ...
1 33
. ...
. ...
. ...
I want to sum up the values with ID '1', but in a group of 5.
i.e.
In the first 5 entries, there are 2 entries with ID '1', so i get a sum 43,
and then in the next 5 entries, only one entry have ID '1', so i get 33.
and so on...
so at the end I want to get a array with all the sums, i.e. (43,33,......)
I can do it with for loop and tapply, but I think there must be a better way in R that doesnt need a for loop
Any help is much appreciated! Thank you very much!

Make a new column to reflect the groups of 5:
df = data.frame(
id = sample(1:5, size=98, replace=TRUE),
value = sample(1:98)
)
# This gets you a vector of 1,1,1,1, 2,2,2,2,2, 3, ...
groups = rep(1:(ceiling(nrow(df) / 5)), each=5)
# But it might be longer than the dataframe, so:
df$group = groups[1:nrow(df)]
Then it's pretty easy to get the sums within each group:
library(plyr)
sums = ddply(
df,
.(group, id),
function(df_part) {
sum(df_part$value)
}
)
Example output:
> head(df)
id value group
1 4 94 1
2 4 91 1
3 3 22 1
4 5 42 1
5 1 46 1
6 2 38 2
> head(sums)
group id V1
1 1 1 46
2 1 3 22
3 1 4 185
4 1 5 42
5 2 2 55
6 2 3 158

Something like this will do the job:
m <- matrix(d$Value, nrow=5)
# Remove unwanted elements
m[which(d$ID != 1)] <- 0
# Fix for short data
if ((length(d$Value) %/% 5) != 0)
m[(length(d$Value)+1):length(m)] <- 0
# The columns contain the groups of 5
colSums(m)

If you add a column to delineate groups, ddply() can work magic:
ID <- c(1, 5, 2, 1, 3, 2, 5, 6, 2, 1)
Value <- c(32, 231, 122, 11, 45, 34, 74, 12, 32, 33)
Group <- rep(seq(100), each=5)[1:length(ID)]
test.data <- data.frame(ID, Value, Group)
library(plyr)
output <- ddply(test.data, .(Group, ID), function(chunk) sum(chunk$Value))
> head(test.data)
ID Value Group
1 1 32 1
2 5 231 1
3 2 122 1
4 1 11 1
5 3 45 1
6 2 34 2
> head(output)
Group ID V1
1 1 1 47
2 1 2 125
3 1 3 49
4 1 5 237
5 2 1 36
6 2 2 74

Related

Drawing multiple barplots on a graph using data with different size

I can plot multiple bar plots on one plot with following code (taken from this question):
mydata <- data.frame(Barplot1=rbinom(5,16,0.6), Barplot2=rbinom(5,16,0.25),
Barplot3=rbinom(5,5,0.25), Barplot4=rbinom(5,16,0.7))
barplot(as.matrix(mydata), main="Interesting", ylab="Total", beside=TRUE,
col=terrain.colors(5))
legend(13, 12, c("Label1","Label2","Label3","Label4","Label5"), cex=0.6,
fill=terrain.colors(5))
But my scenario is a bit different: I have data stored in 3 data.frames (sorted according to V2 column) where V1 column is the Y axis and V2 column is the X axis:
> tail(hist1)
V1 V2
67 2 70
68 2 72
69 1 73
70 2 74
71 1 76
72 1 84
> tail(hist2)
V1 V2
87 1 92
88 3 94
89 1 95
90 2 96
91 1 104
92 1 112
> tail(hist3)
V1 V2
103 3 110
104 1 111
105 2 112
106 2 118
107 2 120
108 1 138
For plotting one single plot it is just simple as:
barplot(hist3$V1, main="plot title", names.arg = hist3$V2)
But I cannot construct the matrix needed for plot because of several problems that I can see right now (maybe there are several others):
My data has different size:
> nrow(hist1)
[1] 72
> nrow(hist2)
[1] 92
> nrow(hist3)
[1] 108
There are X (and therefore Y also) values which are in one list but not in another list e.g.:
> hist3$V2[which(hist3$V2==138)]
[1] 138
> hist1$V2[which(hist1$V2==138)]
integer(0)
What I need (I guess) is something that will create appropriate V2 (x axis) with 0 Y value in appropriate data.frame so they will have same length and I will be able combine them as above example. See following example with only 2 data.frames (v2 and v1 are reversed as in previous example):
> # missing v2 for 3,4,5
> df1
v2 v1
1 1 1
2 2 2
3 6 3
4 7 4
5 8 5
6 9 6
7 10 7
> # missing v2 for 1,2,9,10
> df2
v2 v1
1 3 1
2 4 2
3 5 3
4 6 4
5 7 5
6 8 6
> # some_magic_goes_here ...
> df1
v2 v1
1 1 1
2 2 2
3 3 0 # created
4 4 0 # created
5 5 0 # created
6 6 3
7 7 4
8 8 5
9 9 6
10 10 7
> df2
v2 v1
1 1 0 # created
2 2 0 # created
3 3 1
4 4 2
5 5 3
6 6 4
7 7 5
8 8 6
9 9 0 # created
10 10 0 # created
Thanks
Probably, you can do this by 1) retrieving all possible x-axis values (v2 values) from all data.frames. and 2) using this information to retrieve existing values and/or filling missing ones with zeroes.
set.seed(111)
df1 <- data.frame(v2= sample(1:10, size = 7),
v1 = sample(1:100, size = 1))
df2 <- data.frame(v2= sample(1:10, size = 7),
v1 = sample(1:100, size = 1))
df3 <- data.frame(v2= sample(1:10, size = 7),
v1 = sample(1:100, size = 1))
First, retrieve your categories / x-axis values / v2
Note that if class(df1$v2) == "factor", then you should use levels() instead of unique()
my.x <- unique(c(df1$v2, df2$v2, df3$v2))
Likely, you want it sorted
my.x <- sort(my.x)
Now, use my.x to re-order/fill your data.frames, starting with df1. Specifically, you check each value of my.x: if that value is included in df1$v2, then the corresponding v1 is returned, otherwise 0.
my.df1 <- data.frame(v2 = my.x,
v1 = sapply(my.x, (function(i){
ifelse (i %in% df1$v2, df1$v1[df1$v2 == i], 0)
})))
my.df1
A simple way to apply this operation to all your data.frames is to list them together and then use lapply()
dfs <- list(df1 = df1, df2 = df2, df3 = df3)
dfs <- lapply(dfs, (function(df){
data.frame(v2 = my.x,
v1 = sapply(my.x, (function(i){
ifelse (i %in% df$v2, df$v1[df$v2 == i], 0)
})))
}))
# show all data.frames
dfs
# show df1
dfs$df1

subtracting the greater column from smaller columns in a dataframe in R

I have the input below and I would like to subtract the two columns, but I want to subtract always the lowest value from the highest value.
Because I don't want negative values as a result and sometimes the highest value is in the first column (PaternalOrgin) and other times in the second column (MaternalOrigin).
Input:
df <- PaternalOrigin MaternalOrigin
16 20
3 6
11 0
1 3
1 4
3 11
and the dput output is this:
df <- structure(list(PaternalOrigin = c(16, 3, 11, 1, 1, 3), MaternalOrigin = c(20, 6, 0, 3, 4, 11)), colnames = c("PaternalOrigin", "MaternalOrigin"), row.names= c(NA, -6L), class="data.frame")
Thus, my expected output would look like:
df2 <- PaternalOrigin MaternalOrigin Results
16 20 4
3 6 3
11 0 11
1 3 2
1 4 3
3 11 8
Please, can someone advise me?
Thanks.
We can wrap with abs
transform(df, Results = abs(PaternalOrigin - MaternalOrigin))
# PaternalOrigin MaternalOrigin Results
#1 16 20 4
#2 3 6 3
#3 11 0 11
#4 1 3 2
#5 1 4 3
#6 3 11 8
Or we can assign it to 'Results'
df$Results <- with(df, abs(PaternalOrigin - MaternalOrigin))
Or using data.table
library(data.table)
setDT(df)[, Results := abs(PaternalOrigin - MaternalOrigin)]
Or with dplyr
library(dplyr)
df %>%
mutate(Results = abs(PaternalOrigin - MaternalOrigin))

How to add the results of lapply or for loop to specific rows

I have a database with 5 variables (columns). I want to subset the data frame, in order to check if a certain value of one specific column is present. If this is the case, then assign 1, else 0. Then paste the result (1 or 0) to a specific column of the restricted data frame, and then continue.
The data frame looks like the following:
## Year Month Product Supermarket Price
## 2015 1 67 1 10
## 2015 1 65 1 11
## 2015 1 69 1 15
## 2015 2 65 2 20
## 2015 2 67 2 25
## 2015 2 67 3 15
## 2015 2 69 3 12
Now I want to restrict for each Year, Month and Supermarket and check if Product = 65 is present. If it is, then to assign 1 for the rows restricted in a new variable (column). If not, to assign 0.
I have tried using lapply:
prueba <- function(x)
ifelse(any(base$Product == 65), 1, 0)
lapply(unique(base$Supermarket) & unique(base$Year) & unique(base$Month),
base$NewVar <- prueba)
but have the following result
Error in rep(value, length.out = nrows) :
attempt to replicate an object of type 'closure'
Next, I try to make a for loop:
for(i in unique(base$Supermarket)) {
for(j in unique(base$Year))
for(h in unique(base$Month)) {
try <- ifelse(any((filter(base, Supermarket == i, Year == j, Month == h))$Product == 65), 1, 0)
base[base$Supermarket == i && base$Year ==j && base$Month == h,]$NewVar <- try
}
}
}
And have the following results:
Error in if (nrow(try) == 0) { : argument has zero lenght
I shall say that the database has 50 million rows, so speed is an issue here (so I try to use lapply instead of for loop)
I do not how to obtain the proper result, which should be like the following:
## Year Month Product Supermarket Price NewVar
## 2015 1 67 1 10 1
## 2015 1 65 1 11 1
## 2015 1 69 1 15 1
## 2015 2 65 2 20 1
## 2015 2 67 2 25 1
## 2015 2 67 3 15 0
## 2015 2 69 3 12 0
Do not know how to solve the whole problem. When using lapply I get the "right" answer, but then could not paste the result to the right rows in the dataframe.
Thanks in advance.
For fast operation, try to use data.table or dplyr. With data.table, you can simply create the new variable with logic check grouped by the Year, Month and Supermarket variables(suppose your original data frame is called df):
library(data.table)
setDT(df)[, NewVar := as.numeric(65 %in% Product), .(Year, Month, Supermarket)]
df
# Year Month Product Supermarket Price NewVar
# 1: 2015 1 67 1 10 1
# 2: 2015 1 65 1 11 1
# 3: 2015 1 69 1 15 1
# 4: 2015 2 65 2 20 1
# 5: 2015 2 67 2 25 1
# 6: 2015 2 67 3 15 0
# 7: 2015 2 69 3 12 0
Or correspondingly using dplyr: df <- df %>% group_by(Year, Month, Supermarket) %>% mutate(NewVar = as.numeric(65 %in% Product))
## read data
base <- c(2015, 1, 67, 1, 10,
2015, 1, 65, 1, 11,
2015, 1, 69, 1, 15,
2015, 2, 65, 2, 20,
2015, 2, 67, 2, 25,
2015, 2, 67, 3, 15,
2015, 2, 69, 3, 12)
base <- data.frame(matrix(base, 7, byrow = TRUE))
names(base) <- c('Year', 'Month', 'Product', 'Supermarket', 'Price')
Made a couple changes to function. I changed the object to match input (x) and specified the third element (since column of interest is column 3)
## create function
prueba <- function(x) ifelse(x[3] == 65, 1, 0)
To apply this function to each row, use the apply() function with 1 (for rows) apply(x, 1, function).
base$new_var <- apply(base, 1, prueba)
base
## Year Month Product Supermarket Price new_var
## 1 2015 1 67 1 10 0
## 2 2015 1 65 1 11 1
## 3 2015 1 69 1 15 0
## 4 2015 2 65 2 20 1
## 5 2015 2 67 2 25 0
## 6 2015 2 67 3 15 0
## 7 2015 2 69 3 12 0
You could also create a new variable and conditionally enter '1' to relevant rows. This is the way I'd do it:
base$new_var <- 0
base$new_var[base$Product == 65] <- 1
base
## Year Month Product Supermarket Price new_var
## 1 2015 1 67 1 10 0
## 2 2015 1 65 1 11 1
## 3 2015 1 69 1 15 0
## 4 2015 2 65 2 20 1
## 5 2015 2 67 2 25 0
## 6 2015 2 67 3 15 0
## 7 2015 2 69 3 12 0
We can do this easily in base R
df1$NewVar <- with(df1, ave(Product, Year, Month, Supermarket,
FUN= function(x) 65 %in% x))
df1$NewVar
#[1] 1 1 1 1 1 0 0

R - Counting the number of a specific value in bins

I have a data frame (df) like below:
Value <- c(1,1,0,2,1,3,4,0,0,1,2,0,3,0,4,5,2,3,0,6)
Sl <- c(1:20)
df <- data.frame(Sl,Value)
> df
Sl Value
1 1 1
2 2 1
3 3 0
4 4 2
5 5 1
6 6 3
7 7 4
8 8 0
9 9 0
10 10 1
11 11 2
12 12 0
13 13 3
14 14 0
15 15 4
16 16 5
17 17 2
18 18 3
19 19 0
20 20 6
I would like to create 4 bins out of df and count the occurrences of Value=0 grouped by Sl values in a separate data frame like below:
Bin Count
1 1
2 2
3 2
4 1
I was trying to use table and cut to create the desire data frame but its not clear how I'll specify df$Value and the logic to find the 0s here
df.4.cut <- as.data.frame(table(cut(df$Sl, breaks=seq(1,20, by=5))))
Using your df
tapply(df$Value, cut(df$Sl, 4), function(x) sum(x==0))
gives
> tapply(df$Value, cut(df$Sl, 4), function(x) sum(x==0))
(0.981,5.75] (5.75,10.5] (10.5,15.2] (15.2,20]
1 2 2 1
In cut you can specify the number of breaks or the breaks itself if you prefer and the logic is in the function definition in tapply
Or using data.table, we convert the 'data.frame' to 'data.table' (setDT(df)), using cut output as the grouping variable, we get the sum of 'Value' that are '0' (!Value). By negating (!), the column will be converted to logical vector i.e. TRUE for 0 and FALSE all other values not equal to 0.
library(data.table)
setDT(df)[,sum(!Value) , .(gr=cut(Sl,breaks=seq(0,20, 5)))]
# gr V1
#1: (0,5] 1
#2: (5,10] 2
#3: (10,15] 2
#4: (15,20] 1
Your question used table(), but it lacked a second argument. It is needed to produce a contingency table. You can find the count of each bin with :
table(cut(df$Sl,4),df$Value)
0 1 2 3 4 5 6
(0.981,5.75] 1 3 1 0 0 0 0
(5.75,10.5] 2 1 0 1 1 0 0
(10.5,15.2] 2 0 1 1 1 0 0
(15.2,20] 1 0 1 1 0 1 1
And the count of Value == 0 for each bin :
table(cut(df$Sl,4),df$Value)[,"0"]
(0.981,5.75] (5.75,10.5] (10.5,15.2] (15.2,20]
1 2 2 1
A more convoluted way using sqldf :
First we create a table defining the bins and ranges (min and max):
bins <- data.frame(id = c(1, 2, 3, 4),
bins = c("(0,5]", "(5,10]", "(10,15]", "(15,20]"),
min = c(0, 6, 11, 16),
max = c(5, 10, 15, 20))
id bins min max
1 1 (0,5] 0 5
2 2 (5,10] 6 10
3 3 (10,15] 11 15
4 4 (15,20] 16 20
Then we use the following query using both tables to bin each sl into its respective group using BETWEEN for those Value equal to 0.
library(sqldf)
sqldf("SELECT bins, COUNT(Value) AS freq FROM df, bins
WHERE (((sl) BETWEEN [min] AND [max]) AND Value = 0)
GROUP BY bins
ORDER BY id")
Output:
bins freq
1 (0,5] 1
2 (5,10] 2
3 (10,15] 2
4 (15,20] 1
Another alternative to simplify the construction of bins suggested by mts using cut, extracting the levels of the factor:
bins <- data.frame(id = 1:4,
bins = levels(cut(Sl, breaks = seq(0, 20, 5))),
min = seq(1, 20, 5),
max = seq(5, 20, 5))

remove rows based on substraction results

I have a large data set like this:
df <- data.frame(group = c(rep(1, 6), rep(5, 6)), score = c(30, 10, 22, 44, 6, 5, 20, 35, 2, 60, 14,5))
group score
1 1 30
2 1 10
3 1 22
4 1 44
5 1 6
6 1 5
7 5 20
8 5 35
9 5 2
10 5 60
11 5 14
12 5 5
...
I want to do a subtraction for each neighboring score within each group, if the difference is greater than 30, remove the smaller score. For example, within group 1, 30-10=20<30, 10-22=-12<30, 22-44=-22<30, 44-6=38>30 (remove 6), 44-5=39>30 (remove 5)... The expected output should look like this:
group score
1 1 30
2 1 10
3 1 22
4 1 44
5 5 20
6 5 35
7 5 60
...
Does anyone have idea about realizing this?
Like this?
repeat {
df$diff=unlist(by(df$score,df$group,function(x)c(0,-diff(x))))
if (all(df$diff<30)) break
df <- df[df$diff<30,]
}
df$diff <- NULL
df
# group score
# 1 1 30
# 2 1 10
# 3 1 22
# 4 1 44
# 7 5 20
# 8 5 35
# 10 5 60
This (seems...) to require an iterative approach, because the "neighboring score" changes after removal of a row. So before you remove 6, the difference 44 - 6 > 30, but 6 - 5 < 30. After you remove 6, the difference 44 - 5 > 30.
So this calculates difference between successive rows by group (using by(...) and diff(...)), and removes the appropriate rows, then repeats the process until all differences are < 30.
It's not elegant but it should work:
out = data.frame(group = numeric(), score=numeric())
#cycle through the groups
for(g in levels(as.factor(df$group))){
temp = subset(df, df$group==g)
#now go through the scores
left = temp$score[1]
for(s in seq(2, length(temp$score))){
if(left - temp$score[s] > 30){#Test the condition
temp$score[s] = NA
}else{
left = temp$score[s] #if condition not met then the
}
}
#Add only the rows without NAs to the out
out = rbind(out, temp[which(!is.na(temp$score)),])
}
There should be a way to do this using ave but carrying the last value when removing the next if the diff >30 is tricky! I'd appreciate the more elegant solution if there is one.
You can try
df
## group score
## 1 1 30
## 2 1 10
## 3 1 22
## 4 1 44
## 5 1 6
## 6 1 5
## 7 5 20
## 8 5 35
## 9 5 2
## 10 5 60
## 11 5 14
## 12 5 5
tmp <- df[!unlist(tapply(df$score, df$group, FUN = function(x) c(F, -diff(x) > 30), simplify = T)), ]
while (!identical(df, tmp)) {
df <- tmp
tmp <- df[!unlist(tapply(df$score, df$group, FUN = function(x) c(F, -diff(x) > 30), simplify = T)), ]
}
tmp
## group score
## 1 1 30
## 2 1 10
## 3 1 22
## 4 1 44
## 7 5 20
## 8 5 35
## 10 5 60

Resources