Apply a rule to calculate sum of specific

Apply a rule to calculate sum of specific - r

Hi I have a data set like this.
Num C Pr Value Volume
111 aa Alen 111 222
111 aa Paul 100 200
222 vv Iva 444 555
222 vv John 333 444
I would like to filter the data according to Num and to add a new row where take the sum of column Value and Volume but to keep the information of column Num and C, but in column Pr to put Total. It should look like this way.
Num C Pr Value Volume
222 vv Total 777 999
Could you suggest me how to do it? I would like only for Num 222.
When I try to use res command I end up with this result.
# Num C Pr Value Volume
1: 111 aa Alen 111 222
2: 111 aa Paul 100 200
3: 111 aa Total NA NA
4: 222 vv Iva 444 555
5: 222 vv John 333 444
6: 222 vv Total NA NA
What cause this?
The structure of my data is the following one.
'data.frame': 4 obs. of 5 variables:
$ Num : Factor w/ 2 levels "111","222": 1 1 2 2
$ C : Factor w/ 2 levels "aa","vv": 1 1 2 2
$ Pr : Factor w/ 4 levels "Alen","Iva","John",..: 1 4 2 3
$ Value : Factor w/ 4 levels "100","111","333",..: 2 1 4 3
$ Volume: Factor w/ 4 levels "200","222","444",..: 2 1 4 3

We could use data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'Num', 'C' columns and specifying the columns to do the sum in .SDcols, we loop those columns using lapply, get the sum, and create the 'Pr' column. We can rbind the original dataset with the new summarised output ('DT1') and order the result based on 'Num'.
library(data.table)#v1.9.5+
DT1 <- setDT(df1)[,lapply(.SD, sum) , by = .(Num,C),
.SDcols=Value:Volume][,Pr:='Total'][]
rbind(df1, DT1)[order(Num)]
# Num C Pr Value Volume
#1: 111 aa Alen 111 222
#2: 111 aa Paul 100 200
#3: 111 aa Total 211 422
#4: 222 vv Iva 444 555
#5: 222 vv John 333 444
#6: 222 vv Total 777 999
This can be done using base R methods as well. We get the sum of 'Value', 'Volume' columns grouped by 'Num', 'C', using the formula method of aggregate, transform the output by creating the 'Pr' column, rbind with original dataset and order the output ('res') based on 'Num'.
res <- rbind(df1,transform(aggregate(.~Num+C, df1[-3], FUN=sum), Pr='Total'))
res[order(res$Num),]
# Num C Pr Value Volume
#1 111 aa Alen 111 222
#2 111 aa Paul 100 200
#5 111 aa Total 211 422
#3 222 vv Iva 444 555
#4 222 vv John 333 444
#6 222 vv Total 777 999
EDIT: Noticed that the OP mentioned filter. If this is for a single 'Num', we subset the data, and then do the aggregate, transform steps.
transform(aggregate(.~Num+C, subset(df1, Num==222)[-3], FUN=sum), Pr='Total')
# Num C Value Volume Pr
#1 222 vv 777 999 Total
Or we may not need aggregate. After subsetting the data, we convert the 'Num' to 'factor', loop through the output dataset ('df2') get the sum if it the column is numeric class or else we get the first element and wrap with data.frame.
df2 <- transform(subset(df1, Num==222), Num=factor(Num))
data.frame(c(lapply(df2[-3], function(x) if(is.numeric(x))
sum(x) else x[1]), Pr='Total'))
# Num C Value Volume Pr
#1 222 vv 777 999 Total
data
df1 <- structure(list(Num = c(111L, 111L, 222L, 222L), C = c("aa", "aa",
"vv", "vv"), Pr = c("Alen", "Paul", "Iva", "John"), Value = c(111L,
100L, 444L, 333L), Volume = c(222L, 200L, 555L, 444L)), .Names = c("Num",
"C", "Pr", "Value", "Volume"), class = "data.frame",
row.names = c(NA, -4L))

Or using dplyr:
library(dplyr)
df1 %>%
filter(Num == 222) %>%
summarise(Value = sum(Value),
Volume = sum(Volume),
Pr = 'Total',
Num = Num[1],
C = C[1])
# Value Volume Pr Num C
# 1 777 999 Total 222 vv
where we first filter to keep only Num == 222, and then use summarise to obtain the sums and the values for Num and C. This assumes that:
You do not want to get the result for each unique Num (I select one here, you could select multiple). If you need this, use group_by.
There is only ever one C for every unique Num.

You can also use a dplyr package:
df %>%
filter(Num == 222) %>%
group_by(Num, C) %>%
summarise(
Pr = "Total"
, Value = sum(Value)
, Volume = sum(Volume)
) %>%
rbind(df, .)
# Num C Pr Value Volume
# 1 111 aa Alen 111 222
# 2 111 aa Paul 100 200
# 3 222 vv Iva 444 555
# 4 222 vv John 333 444
# 5 222 vv Total 777 999
If you want the total for each Num value you just comment a filter line

Related

How to get 3 lists with no duplicates in a random sampling? (R)

I have done the first step:
how many persons have more than 1 point
how many persons have more than 3 points
how many persons have more than 6 points
My goal:
I need to have random samples (with no duplicates of persons)
of 3 persons that have more than 1 point
of 3 persons that have more than 3 points
of 3 persons that have more than 6 points
My dataset looks like this:
id person points
201 rt99 NA
201 rt99 3
201 rt99 2
202 kt 4
202 kt NA
202 kt NA
203 rr 4
203 rr NA
203 rr NA
204 jk 2
204 jk 2
204 jk NA
322 knm3 5
322 knm3 NA
322 knm3 3
343 kll2 2
343 kll2 1
343 kll2 5
344 kll NA
344 kll 7
344 kll 1
345 nn 7
345 nn NA
490 kk 1
490 kk NA
490 kk 2
491 ww 1
491 ww 1
489 tt 1
489 tt 1
325 ll 1
325 ll 1
325 ll NA
That is what I have already tried to code, here is an example of code for finding persons that have more than 1 point:
persons_filtered <- dataset %>%
group_by(person) %>%
dplyr::filter(sum(points, na.rm = T)>1) %>%
distinct(person) %>%
pull()
person_filtered
more_than_1 <- sample(person_filtered, size = 3)
Question:
How to write this code better that I could have in the end 3 lists with unique persons. (I need to prevent to have same persons in the lists)

Here's a tidyverse solution, where the sampling in the three categories of interest is made at the same time.
library(tidyverse)
dataset %>%
# Group by person
group_by(person) %>%
# Get points sum
summarize(sum_points = sum(points, na.rm = T)) %>%
# Classify the sum points into categories defined by breaks, (0-1], (1-3] ...
# I used 100 as the last value so that all sum points between 6 and Inf get classified as (6-Inf]
mutate(point_class = cut(sum_points, breaks = c(0,1,3,6,Inf))) %>%
# ungroup
ungroup() %>%
# group by point class
group_by(point_class) %>%
# Sample 3 rows per point_class
sample_n(size = 3) %>%
# Eliminate the sum_points column
select(-sum_points) %>%
# If you need this data in lists you can nest the results in the sampled_data column
nest(sampled_data= -point_class)

R: sum rows from column A until conditioned value in column B

I'm pretty new to R and can't seem to figure out how to deal with what seems to be a relatively simple problem. I want to sum the rows of the column 'DURATION' per 'TRIAL_INDEX', but then only those first rows where the values of 'X_POSITION" are increasing. I only want to sum the first round within a trial where X increases.
The first rows of a simplified dataframe:
TRIAL_INDEX DURATION X_POSITION
1 1 204 314.5
2 1 172 471.6
3 1 186 570.4
4 1 670 539.5
5 1 186 503.6
6 2 134 306.8
7 2 182 503.3
8 2 806 555.7
9 2 323 490.0
So, for TRIAL_INDEX 1, only the first three values of DURATION should be added (204+172+186), as this is where X has the highest value so far (going through the dataframe row by row).
The desired output should look something like:
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122
I tried to use dplyr, to generate a new dataframe that can be merged with my original dataframe.
However, the code doesn't work, and also I'm not sure on how to make sure it's only adding the first rows per trial that have increasing values for X_POSITION.
FirstPassRT = dat %>%
group_by(TRIAL_INDEX) %>%
filter(dplyr::lag(dat$X_POSITION,1) > dat$X_POSITION) %>%
summarise(FIRST_PASS_TIME=sum(DURATION))
Any help and suggestions are greatly appreciated!

library(data.table)
dt = as.data.table(df) # or setDT to convert in place
# find the rows that will be used for summing DURATION
idx = dt[, .I[1]:.I[min(.N, which(diff(X_POSITION) < 0), na.rm = T)], by = TRIAL_INDEX]$V1
# sum the DURATION for those rows
dt[idx, time := sum(DURATION), by = TRIAL_INDEX][, time := time[1], by = TRIAL_INDEX]
dt
# TRIAL_INDEX DURATION X_POSITION time
#1: 1 204 314.5 562
#2: 1 172 471.6 562
#3: 1 186 570.4 562
#4: 1 670 539.5 562
#5: 1 186 503.6 562
#6: 2 134 306.8 1122
#7: 2 182 503.3 1122
#8: 2 806 555.7 1122
#9: 2 323 490.0 1122

Here is something you can try with dplyr package:
library(dplyr);
dat %>% group_by(TRIAL_INDEX) %>%
mutate(IncLogic = X_POSITION > lag(X_POSITION, default = 0)) %>%
mutate(FIRST_PASS_TIME = sum(DURATION[IncLogic])) %>%
select(-IncLogic)
Source: local data frame [9 x 4]
Groups: TRIAL_INDEX [2]
TRIAL_INDEX DURATION X_POSITION FIRST_PASS_TIME
(int) (int) (dbl) (int)
1 1 204 314.5 562
2 1 172 471.6 562
3 1 186 570.4 562
4 1 670 539.5 562
5 1 186 503.6 562
6 2 134 306.8 1122
7 2 182 503.3 1122
8 2 806 555.7 1122
9 2 323 490.0 1122

If you want to summarize it down to one row per trial you can use summarize like this:
library(dplyr)
df <- data_frame(TRIAL_INDEX = c(1,1,1,1,1,2,2,2,2),
DURATION = c(204,172,186,670, 186,134,182,806, 323),
X_POSITION = c(314.5, 471.6, 570.4, 539.5, 503.6, 306.8, 503.3, 555.7, 490.0))
res <- df %>%
group_by(TRIAL_INDEX) %>%
mutate(x.increasing = ifelse(X_POSITION > lag(X_POSITION), TRUE, FALSE),
x.increasing = ifelse(is.na(x.increasing), TRUE, x.increasing)) %>%
filter(x.increasing == TRUE) %>%
summarize(FIRST_PASS_TIME = sum(X_POSITION))
res
#Source: local data frame [2 x 2]
#
# TRIAL_INDEX FIRST_PASS_TIME
# (dbl) (dbl)
#1 1 1356.5
#2 2 1365.8

Sorting data.frame in r [duplicate]

I am new to R, and want to sort a data frame called "weights". Here are the details:
>str(weights)
'data.frame': 57 obs. of 1 variable:
$ attr_importance: num 0.04963 0.09069 0.09819 0.00712 0.12543 ...
> names(weights)
[1] "attr_importance"
> dim(weights)
[1] 57 1
> head(weights)
attr_importance
make 0.049630556
address 0.090686474
all 0.098185517
num3d 0.007122618
our 0.125433292
over 0.075182467
I want to sort by decreasing order of attr_importance BUT I want to preserve the corresponding row names also.
I tried:
> weights[order(-weights$attr_importance),]
but it gives me a "numeric" back.
I want a data frame back - which is sorted by attr_importance and has CORRESPONDING row names intact. How can I do this?
Thanks in advance.

Since your data.frame only has one column, you need to set drop=FALSE to prevent the dimensions from being dropped:
weights[order(-weights$attr_importance),,drop=FALSE]
# attr_importance
# our 0.125433292
# all 0.098185517
# address 0.090686474
# over 0.075182467
# make 0.049630556
# num3d 0.007122618

Here is the big comparison on data.frame sorting:
How to sort a dataframe by column(s)?
Using my now-preferred solution arrange:
dd <- data.frame(b = factor(c("Hi", "Med", "Hi", "Low"),
levels = c("Low", "Med", "Hi"), ordered = TRUE),
x = c("A", "D", "A", "C"), y = c(8, 3, 9, 9),
z = c(1, 1, 1, 2))
library(plyr)
arrange(dd,desc(z),b)
b x y z
1 Low C 9 2
2 Med D 3 1
3 Hi A 8 1
4 Hi A 9 1

rankdata.txt
regno name total maths science social cat
1 SUKUMARAN 400 78 89 73 S
2 SHYAMALA 432 65 79 87 S
3 MANOJ 500 90 129 78 C
4 MILYPAULOSE 383 59 88 65 G
5 ANSAL 278 39 77 60 O
6 HAZEENA 273 45 55 56 O
7 MANJUSHA 374 50 99 52 C
8 BILBU 408 81 97 72 S
9 JOSEPHROBIN 374 57 85 68 G
10 SHINY 381 70 79 70 S
z <- data.frame(rankdata)
z[with(z, order(-total+ maths)),] #order function maths group selection
z
z[with(z, order(name)),] # sort on name
z

Count number of occurances of a string in R under different conditions

I have a dataframe, with multiple columns called "data" which looks like this:
Preferences Status Gender
8a 8b 9a Employed Female
10b 11c 9b Unemployed Male
11a 11c 8e Student Female
That is, each customer selected 3 preferences and specified other information such as Status and Gender. Each preference is given by a [number][letter] combination, and there are c. 30 possible preferences. The possible preferences are:
8[a - c]
9[a - k]
10[a - d]
11[a - c]
12[a - i]
I want to count the number of occurrences of each preference, under certain conditions for the other columns - eg. for all women.
The output will ideally be a dataframe that looks like this:
Preference Female Male Employed Unemployed Student
8a 1034 934 234 495 203
8b 539 239 609 394 235
8c 124 395 684 94 283
9a 120 999 895 945 345
9b 978 385 596 923 986
etc.
What's the most efficient way to achieve this?
Thanks.

I am assuming you are starting with something that looks like this:
mydf <- structure(list(
Preferences = c("8a 8b 9a", "10b 11c 9b", "11a 11c 8e"),
Status = c("Employed", "Unemployed", "Student"),
Gender = c("Female", "Male", "Female")),
.Names = c("Preferences", "Status", "Gender"),
class = c("data.frame"), row.names = c(NA, -3L))
mydf
# Preferences Status Gender
# 1 8a 8b 9a Employed Female
# 2 10b 11c 9b Unemployed Male
# 3 11a 11c 8e Student Female
If that's the case, you need to "split" the "Preferences" column (by spaces), transform the data into a "long" form, and then reshape it to a wide form, tabulating while you do so.
With the right tools, this is pretty straightforward.
library(devtools)
library(data.table)
library(reshape2)
source_gist(11380733) # for `cSplit`
dcast.data.table( # Step 3--aggregate to wide form
melt( # Step 2--convert to long form
cSplit(mydf, "Preferences", " ", "long"), # Step 1--split "Preferences"
id.vars = "Preferences"),
Preferences ~ value, fun.aggregate = length)
# Preferences Employed Female Male Student Unemployed
# 1: 10b 0 0 1 0 1
# 2: 11a 0 1 0 1 0
# 3: 11c 0 1 1 1 1
# 4: 8a 1 1 0 0 0
# 5: 8b 1 1 0 0 0
# 6: 8e 0 1 0 1 0
# 7: 9a 1 1 0 0 0
# 8: 9b 0 0 1 0 1
I also tried a dplyr + tidyr approach, which looks like the following:
library(dplyr)
library(tidyr)
mydf %>%
separate(Preferences, c("P_1", "P_2", "P_3")) %>% ## splitting things
gather(Pref, Pvals, P_1:P_3) %>% # stack the preference columns
gather(Var, Val, Status:Gender) %>% # stack the status/gender columns
group_by(Pvals, Val) %>% # group by these new columns
summarise(count = n()) %>% # aggregate the numbers of each
spread(Val, count) # spread the values out
# Source: local data table [8 x 6]
# Groups:
#
# Pvals Employed Female Male Student Unemployed
# 1 10b NA NA 1 NA 1
# 2 11a NA 1 NA 1 NA
# 3 11c NA 1 1 1 1
# 4 8a 1 1 NA NA NA
# 5 8b 1 1 NA NA NA
# 6 8e NA 1 NA 1 NA
# 7 9a 1 1 NA NA NA
# 8 9b NA NA 1 NA 1
Both approaches are actually pretty quick. Test it with some better sample data than what you shared, like this:
preferences <- c(paste0(8, letters[1:3]),
paste0(9, letters[1:11]),
paste0(10, letters[1:4]),
paste0(11, letters[1:3]),
paste0(12, letters[1:9]))
set.seed(1)
nrow <- 10000
mydf <- data.frame(
Preferences = vapply(replicate(nrow,
sample(preferences, 3, FALSE),
FALSE),
function(x) paste(x, collapse = " "),
character(1L)),
Status = sample(c("Employed", "Unemployed", "Student"), nrow, TRUE),
Gender = sample(c("Male", "Female"), nrow, TRUE)
)

How to split the vector into small group in R?

x<-rnorm(5000,5,3)
How can i split x into 500 groups ,there are ten numbers in every group ?

Answer #1:
x<-rnorm(5000,5,3)
y<-matrix(nr=500,nc=10)
y[]<-x
Answer #2:
Skip the first step and just create the matrix directly.
y<-matrix(rnorm(5000,5,3),nr=500,nc=10)

Are you looking for something like this :
# create a vector of group labels
group <- rep(sample(1:500,replace=F,size=500),10)
group.name <- paste("group",as.character(group),sep=" ")
# create a dataframe of groups and corresponding values
df <- data.frame(group=group.name,value=rnorm(5000,5,3))
# check the dataframe
str(df)
'data.frame': 5000 obs. of 2 variables:
$ group: Factor w/ 500 levels "group 1","group 10",..: 271 115 404 252 138 243 375 308 434 16 ...
$ value: num 8.55 10.14 3.71 8.79 4.17 ...
head(df)
group value
1 group 342 8.547406
2 group 201 10.135465
3 group 462 3.713305
4 group 325 8.786934
5 group 222 4.171373
6 group 317 3.478123