Calculation within a pipe between different rows of a data frame - r

I have a tibble with a column of different numbers. I wish to calculate for every one of them how many others before them are within a certain range.
For example, let's say that range is 200 ; in the tibble below the result for the 5th number would be 2, that is the cardinality of the list {816, 705} whose numbers are above 872-1-200 = 671 but below 872.
I have thought of something along the lines of :
for every theRow of the tibble, do calculate the vector theTibble$number_list between(X,Y) ;
summing the boolean returned vector.
I have been told that using loops is less efficient.
Is there a clean way to do this within a pipe without using loops?

Not the way you asked for it, but you can use a bit of linear algebra. Should be more efficient and more simple than a loop.
number_list <- c(248,650,705,816,872,991,1156,1157,1180,1277)
m <- matrix(number_list, nrow = length(number_list), ncol = length(number_list))
d <- (t(m) - number_list)
cutoff <- 200
# I used setNames to name the result, but you do not need to
# We count inclusive of 0 in case of ties
setNames(colSums(d >= 0 & d < cutoff) - 1, number_list)
Which gives you the following named vector.
248 650 705 816 872 991 1156 1157 1180 1277
0 0 1 2 2 2 1 2 3 3
Here is another way that is pipe-able using rollapply().
library(zoo)
cutoff <- 200
df %>%
mutate(count = rollapply(number_list,
width = seq_along(number_list),
function(x) sum((tail(x, 1) - head(x, -1)) <= cutoff),
align = "right"))
Which gives you another column.
# A tibble: 10 x 2
number_list count
<int> <int>
1 248 0
2 650 0
3 705 1
4 816 2
5 872 2
6 991 2
7 1156 1
8 1157 2
9 1180 3
10 1277 3

Related

how to find which rows are related by mathematical difference of x in R

i have a data frame with about 20k IDs of chemical compounds and the corresponding molecular weights, something like this:
ID <- c(1,2,3,4,5)
MASS <- c(324,162,508,675,670)
d <- data.frame(ID, MASS)
ID MASS
1 1 324
2 2 162
3 3 508
4 4 675
5 5 670
I would like to find a way to loop over the rows of the column MASS to find which masses are related by having a difference (positive or negative) of 162∓0.5. Then I would like to have a new column (d$DIFF) where the IDs that are linked by a MASS difference of 162∓0.5 are reported, while get 0 for those IDs when the condition is not met, in this example it would be something like this:
ID MASS DIFF
1 1 324 1&2
2 2 162 1&2
3 3 508 3&5
4 4 675 0
5 5 670 3&5
Thanks in advance for any help
Here's a base R solution using outer:
d$DIFF <- unlist(lapply(apply(outer(d$MASS, d$MASS,
function(x, y) abs((abs(x - y) - 162)) < 0.5), 1, which),
function(x) if(length(x) == 0)
return("0")
else
return(paste(x, collapse = " & "))))
This gives the result:
d
#> ID MASS DIFF
#> 1 1 324 2
#> 2 2 162 1
#> 3 3 508 5
#> 4 4 675 0
#> 5 5 670 3
Note that in your example data, there is at most a single match to other rows, but if you apply this technique to your real data you should get multiple hits for some rows separated by "&" as requested.
You should also note that whatever way you do this in your real data, you will have to make approximately 20K * 20K (400 million) comparisons, so it may take some time to complete, and may result in memory issues depending on your set-up.

Copy a subset of a column, based on conditions, to another dataframe in R

I have very limited R skills, and after hours searching for a solution I could not see an option that would work.
I have several large data tables. From each one, I would like to copy part of a column into an dataframe, to populate a column there.
My data tables (tabn1, tabn2, tabn3) all have the same format, but with different lengths. Each subset will have a different number of rows. I would want empty spaces to be filled with NA. I can't even copy the first column, so the subsequent are the next problem!
Ro Co Red Green Yellow
1 3 123 999 265
1 3 223 875 5877
1 4 21488 555 478
1 4 558 23698 5558
2 3 558 559 148
2 3 4579 557 59
2 4 1489 545 2369
2 4 123 999 265
3 3 558 559 148
3 3 558 23698 5558
3 4 4579 557 59
3 4 1478 4579 557
4 3 1488 555 478
4 3 1478 2945 5889
4 4 448 259 4548
4 4 26576 158 15
My new data frame col names:
cls <- c("n1","n2","n3")
I created a dataframe with the column names:
df <- setNames(data.frame(matrix(ncol=3)),cls)
For each of my tables, I want to subset Ro > = 3, Co = 3, column "Red" only
I have tried:
sub1 <- (filter(tabn1, tabn1$Ro >=3 | tabn$Co == 3)
df$n1 <- sub1$Red
> Error in `$<-.data.frame`(`*tmp*`, n1, value = c(183.94, 180.884, :
replacement has 32292 rows, data has 1
Also:
df$n1 <- cut(sub1$Red)
> Error in cut.default(sub1$Red) :
argument "breaks" is missing, with no default
I tried using df as a datatable instead of dataframe, but also got the following errors:
df <- setNames(data.table(matrix(ncol=3)),cls)
df$n1 <- sub1$Red
> Error in set(x, j = name, value = value) :
Supplied 32292 items to be assigned to 1 items of column 'nn1'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
I would subsequently tried to subset and copy from tabn2 to df$n2, and so forth. As indicated above, the original tables have different lengths.
Thanks in advance!
The issue is that the number of rows in 'df' and 'sub1' are different. 'df' is created with 1 row. Instead, we can create the 'df' directly from the 'sub1' itself
df <- sub1['Red']
names(df) <- cls[1]
Also, another way to create the data.frame, would be to specify the nrow as well
df <- as.data.frame(matrix(nrow = nrow(sub1), ncol = length(cls)),
dimnames = list(NULL, cls))
Regarding the second error with cut, it needs breaks. Either we specify the number of breaks
cut(sub1$Red, breaks = 3)
Or a vector of break points
cut(sub1$Red, breaks = c(-Inf, 100, 500, 1000, Inf))
If there are many 'tabn' objects, get them into a list, loop over the list with lapply
lst1 <- mget(ls(pattern = '^tabn\\d+$'))
out_lst <- lapply(lst1, function(x) subset(x, Ro >=3 | Co == 3)$Red)
It is possible that after subsetting and selecting the 'Red' column, the number of elements may be different. If the lengths are different, a option is to pad NA at the end for those having lesser number of elements before cbinding it
mx <- max(lengths(out_lst))
df <- do.call(cbind, lapply(out_lst, `length<-`, mx))

How to count number of occurence in a large dataset [duplicate]

This question already has answers here:
Find frequency of each column in data.frame
(2 answers)
calculating the frequency of occurrences in every column
(4 answers)
Closed 3 years ago.
I'm trying to count the number of occurence of each "scenarios" that I have (0 to 9) in a data frame over 25 years.
Basically, I have 10000 simulations of scenarios named 0 to 9, each scenario having a probability of occurence.
My dataframe is too big to paste in here but here's a preview:
simulation=as.data.frame(replicate(10000,sample(c(0:9),size=25,replace=TRUE,prob=prob)))
simulation2=transpose(simulation)
Note** prob is a vector with the probability to observe each scenario
v1 v2 v3 v4 v5 v6 ... v25
1 0 0 4 0 2 0 9
2 1 0 0 2 3 0 6
3 0 4 6 2 0 0 0
4
...
10000
This is what I have tried so far:
for (i in c(1:25)){
for (j in c(0:9)){
f=sum(simulation2[,i]==j);
vect_f=c(vect_f,f)
}
vect_f=as.data.frame(vect_f)
}
If I omit the "for (i in c(1:25))", this returns me the right first column of the output desired. Now I am trying to replicate this over 25 years. When I put the second 'for' I do not get the output desired.
The output should look like this :
(Year) 1 2 3 4 5 6 ... 25
(Scenario)
0 649
1 239
...
9 11
649 being the number of times 'scenario 0' is observed the first year over my 10 000 simulations.
Thanks for your help
We can use table
sapply(simulation2, table)
# V1 V2 V3 V4 V5 .....
#0 1023 1050 994 1016 1022 .....
#1 1050 968 950 1001 981 .....
#2 997 969 1004 999 949 .....
#3 1031 977 1001 993 1009 .....
#4 1017 1054 1020 1003 985 .....
#......
If there are certain values missing in a column we can convert the numbers to factor including all levels
sapply(simulation2, function(x) table(factor(x, levels = 0:9)))
The base R answer from Ronak works well, but I think he meant to use simulation instead of simulation2.
sapply(simulation, function(x) table(factor(x, levels = 0:9)))
I tried to do the same thing using dplyr, since I find the tidyverse code more readable.
simulation %>%
rownames_to_column("i") %>%
gather(year, scenario, -i) %>%
count(year, scenario) %>%
spread(year, n, fill = 0)
However do note that this last option is a bit slower than the base-R code (about twice slower on my machine using your 10 000 row example)

Creating a table of results over multiple variables in R

I am using a large dataset that contains multiple variables that contain similar information. The variables range from PR1 through PR25. Each contains information regarding a procedure code. in short the dataframe looks like this:
Obs PR1 PR2 PR3
1 527 1422 222
2 1600 527 569
3 341 222 341
4 222 569 1422
5 569 341 1660
Where PR1 through PR25 values are factors.
I am looking for a way to make a table of information across all of these variables. For instance, I would like to make a table that shows a count of total number of value "527" for PR1:PR25. I would like to do this for multiple values of interest.
For instance
PR Tot
#222 3
#341 3
#527 2
#569 3
#1600 1
#1660 1
However, I only want to retrieve the frequency for a very specific set of values such as only extracting the frequency of 527 or 1600.
I have initially tried using a simple function like length(which(PR1=="527")), which works but is tedious.
I used the method suggested by Soren using:
library(plyr)
all_codes <- data.frame(codes=unlist(lapply(df,levels),use.names=F))
result <- ddply(all_codes,.(codes),summarize,count=length(codes))
result[which(result$codes %in% c("527", "5251", "5252", "5253", "5259",
"526", "521", "529", "8512", "8521", "344", "854", "8523", "8541", "8546",
"8542", "8547" , "8544", "8545", "8543", "639",
"064","065","063","0650","0651", "0652", "062", "066", "4040", "4041",
"4042", "0721", "0712","0701", "0702", "070", "0741", "435","436", "4399",
"439", "438", "437", "4381", "4391", "4342", "5122", "5121", "5124", "5123",
"518", "519", "503", "5022", "5012")),]
And got the following output (abbreviated):
codes count
92 062 5
95 064 8
96 0650 2
769 526 8
770 527 8
However, I had a feeling that was incorrect. When I checked it against the output from sapply(df, function(PR1) length(which(PR1 == "527")))
I get the following:
PR1 PR2 PR3 PR4 PR5 PR6 PR7 PR8 ...
1152 36 6 1 2 1 1 1
Which is the correct number of "527" cases in the dataframe. Any suggestions why the first method is giving incorrect sums of factor levels?
Thanks for any help, and let me know if I can provide more info
You can use sapply() or lapply() function to get count of a some value over all columns.
Create data frame df
df <- data.frame(A = 1:4, B = c(4,4,4,4), C = c(2,3,4,4), D = 9:12)
df
# A B C D
# 1 1 4 2 9
# 2 2 4 3 10
# 3 3 4 4 11
# 4 4 4 4 12
Frequency of value "4" in each column A, B, C, and D using sapply() function
sapply(df, function(x) length(which(x == 4)))
A B C D
1 4 2 0
Frequency of value "4" in each column A, B, C, and D using lapply() function
lapply(df, function(x) length(which(x == 4)))
# $A
# [1] 1
# $B
# [1] 4
# $C
# [1] 2
# $D
# [1] 0
The following takes your example and returns an output that may be generalized across all 25 columns. The "plyr" library is used to create the aggregated counts
Scripted as follows:
library(plyr)
df <- data.frame(PR1=c("527","1600","341","222","569"),PR2=c("1422","527","222","569","341"),PR3=c("222","569","341","1422","1660"),stringsAsFactors = T)
all_codes <- data.frame(codes=unlist(lapply(df,levels),use.names=F))
result <- ddply(all_codes,.(codes),summarize,count=length(codes))
result[which(result$codes %in% c('527','222')),]
Explained as follows:
Create the data frame as specified above. As OP noted values are factors, stringsAsFactors is set to TRUE
df <- data.frame(
PR1=c("527","1600","341","222","569"),
PR2=c("1422","527","222","569","341"),
PR3=c("222","569","341","1422","1660"),
stringsAsFactors = T)
Reviewing results of df
df
PR1 PR2 PR3
1 527 1422 222
2 1600 527 569
3 341 222 341
4 222 569 1422
5 569 341 1660
As OP asks to combine all the codes across PR1:PR25 a these are unified into a single list by using lapply to loop across all the columns. However, as these are factors -- and it seems that the interest in the in the level value of the factor and not its underlying numeric representation, lapply(df,levels) returns these values. To merge into a single list PR1:PR25 it's simply unlist() and since the column names are seemingly not useful in this case, use.names is set to FALSE. Finally, a data.frame is created with the single column called codes, which is later fed into the ddply() function to get the counts.
all_codes <- data.frame(codes=unlist(lapply(df,levels),use.names=F))
all_codes
codes
1 1600
2 222
3 341
4 527
5 569
6 1422
7 222
8 341
9 527
10 569
11 1422
12 1660
13 222
14 341
15 569
Uisng ddply() to split() the data.frame on df$codes value and then take the length() of each vector returned by split in ddply()
result <- ddply(all_codes,.(codes),summarize,count=length(codes))
result
Reviewing the result gives the PR1:PR25 aggregated count of all the level values of each factor in the original data.frame
codes count
1 1422 2
2 1600 1
3 1660 1
4 222 3
5 341 3
6 527 2
7 569 3
And since we're only interested in specific values (527 given in OP, but here two values of interest are exemplified, 527 and 222:
result[which(result$codes %in% c('527','222')),]
codes count
4 222 3
6 527 2

Find a function to return value based on condition using R

I have a table with values
KId sales_month quantity_sold
100 1 0
100 2 0
100 3 0
496 2 6
511 2 10
846 1 4
846 2 6
846 3 1
338 1 6
338 2 0
now i require output as
KId sales_month quantity_sold result
100 1 0 1
100 2 0 1
100 3 0 1
496 2 6 1
511 2 10 1
846 1 4 1
846 2 6 1
846 3 1 0
338 1 6 1
338 2 0 1
Here, the calculation has to go as such if quantity sold for the month of march(3) is less than 60% of two months January(1) and February(2) quantity sold then the result should be 1 or else it should display 0. Require solution to perform this.
Thanks in advance.
If I understand well, your requirement is to compare sold quantity in month t with the sum of quantity sold in months t-1 and t-2. If so, I can suggest using dplyr package that offer the nice feature of grouping rows and mutating columns in your data frame.
resultData <- group_by(data, KId) %>%
arrange(sales_month) %>%
mutate(monthMinus1Qty = lag(quantity_sold,1), monthMinus2Qty = lag(quantity_sold, 2)) %>%
group_by(KId, sales_month) %>%
mutate(previous2MonthsQty = sum(monthMinus1Qty, monthMinus2Qty, na.rm = TRUE)) %>%
mutate(result = ifelse(quantity_sold/previous2MonthsQty >= 0.6,0,1)) %>%
select(KId,sales_month, quantity_sold, result)
The result is as below:
Adding
select(KId,sales_month, quantity_sold, result)
at the end let us display only columns we care about (and not all these intermediate steps).
I believe this should satisfy your requirement. NA is the result column are due to 0/0 division or no data at all for the previous months.
Should you need to expand your calculation beyond one calendar year, you can add year column and adjust group_by() arguments appropriately.
For more information on dplyr package, follow this link

Resources