R For Loop anomaly when expanding the range - r

Assume the following dataframe:
Application <- c('A','A','B','B','B','C','C','D')
Rating <- c('0','0.6','0.6','2.0','2.0','3.8','3.8','3.9')
DF <- data.frame(Application,Rating)
DF
#Application Score
#1 A 0
#2 A 0.6
#3 B 0.6
#4 B 2.0
#5 B 2.0
#6 C 3.8
#7 C 3.8
#8 D 3.9
I want to create an empty results table to be populated through a loop:
1st column - to show the rating being counted (e.g. 0.6)
2nd column - to show the number of times that rating occurs in DF
3rd column - to list total number of ratings in DF (i.e. 8)
4th column - to calculate the proportion of the applications with that rating relative to the overall
#create empty results table
results_rating_bins <- as.data.frame(matrix(nrow = 1, ncol = 4))
#initiate row count
rownr = 1
#Loop:
for (rating in seq(from = 0, to = 4.0, by = 0.1)) {
this_rating <- subset(DF, DF$Score == rating)
results_rating_bins[rownr, 1] = rating
results_rating_bins[rownr, 2] = nrow(this_rating)
results_rating_bins[rownr, 3] = nrow(DF)
results_rating_bins[rownr, 4] = nrow(this_rating) / nrow(DF)
rownr <- rownr + 1
}
The final result is what I expect, except for rating 2.0 where the count is 0 even though it should be 2.
This illustrates at small scale, what I see at larger scale with a 30k line dataset. I have a list of apps with ratings going from 0 to 4.9, so the range in my loop would be set to 0 to 4.9 instead of 0.6 to 4.0 in my example. However, when I run the loop on the large dataset I end up with a number of instances where the rating count is 0 even though it shouldn't be. What's even more odd, is that by playing around with the ranges, the ratings where the anomaly (i.e. count = 0) happens varies completely randomly.
Any idea what may justify this type of behaviour?
Amnesty

Typically I answer the questions as asked, trying to work through the logic a question poster is already using. However, in this case, it is so much easier to use dplyr to aggregate into the new table that I am breaking with tradition.
require(dplyr)
Application <- c('A','A','B','B','B','C','C','D')
Rating <- c('0','0.6','0.6','2.0','2.0','3.8','3.8','3.9')
DF <- data.frame(Application,Rating)
df2<-DF%>%
group_by(Application, Rating)%>%
summarize(ratio=(n()/nrow(DF)))
The first part is the same as yours, but with the library call added
where it starts df2 you are setting the df2 data frame equal to a grouped version of your initial data frame based on the combinations of Application and Rating. In the summarize statement, for each possible combination we tell it to count the number n() and divide it by the total number of rows in the original data frame nrow(DF), This creates the third row of your new the percent of total each pair represents.
It looks like this and you could add the column with the number of rows with another summarize statement if you need it, but to perform this function, it is not necessary.
Application Rating ratio
1 A 0 0.125
2 A 0.6 0.125
3 B 0.6 0.125
4 B 2.0 0.250
5 C 3.8 0.250
6 D 3.9 0.125
This will absolutely catch every combination of Application and Rating and calculate the ratio relative to the whole data frame.
EDIT: If you do not care about the Application letter, you cans imply remove it from the group_by function and still get what you want.
And add
%>%
summarise(rows=nrow(DF))
if you want the total number of rows in the frame on each row

Related

Adding specific column value according to row value

I have a dataframe h3
Genotype Preference
Rice 1
Rice 2
Lr 3
Lr 3
th 4
th 7
I want the dataframe to look like
Genotype Preference Haplotype
Rice 1 1
Rice 2 1
Lr 3 2
Lr 3 2
th 4 0.5
th 7 0.5
That is I want to add a numerical variable to be added to the each type of genotype. I have around 100 observations for each type of genotype. I want to be able to add the numberical variable into a new column in a single line of code and ensure that 1 is added corresponding to rice, 2 to Lr and 0.5 to th.
I tried constructing the code with the mutate/ifelse:
h3 %>% select(Genotype) %>% mutate(Type = ifelse (Genotype = c("Rice"), 1, Genotype))
Other results which I looked up, provide solutions for adding a column with a calculated value from the previous columns but not specific values.
I have found this dplyr mutate with conditional values and Apply R-function to rows depending on value in other column but dont know how to modify it for my code.
Any help with this will be greatly appreciated.
Using dplyr, you can do:
library(dplyr)
df %>% mutate(Haplotype = ifelse(Genotype == "Rice",1,ifelse(Genotype == "Lr",2,0.5)))
Using base, you can do the same thing:
df$Haplotype = ifelse(df$Genotype == "Rice",1,ifelse(df$Genotype == "Lr",2,0.5))
Data
df = data.frame("Genotype" = c("Rice","Rice","Lr","Lr","Th","Th"),
"Preference" = c(1,2,3,3,4,7))

Complex data calculation for consecutive zeros at row level in R (lag v/s lead)

I have a complex calculation that needs to be done. It is basically at a row level, and i am not sure how to tackle the same.
If you can help me with the approach or any functions, that would be really great.
I will break my problem into two sub-problems for simplicity.
Below is how my data looks like
Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0
I have data for each Group at Monthly Level.
I would like to capture the below two things.
1. The count of consecutive zeros for each row to-and-fro from lag0(reference)
The highlighted yellow are the cases, that are consecutive with lag0(reference) to a certain point, that it reaches first 1. I want to capture the count of zero's at row level, along with the corresponding Sales value.
Below is the output i am looking for the part1.
Output:
Month,Sales,Count
1,2503,9
2,3734,3
3,6631,5
4,8606,0
5,1889,6
6,4819,1
7,5120,1
2. Identify the consecutive rows(row:1,2 and 3 & similarly row:5,6) where overlap of any lag or lead happens for any 0 within the lag0(reference range), and capture their Sales and Month value.
For example, for row 1,2 and 3, the overlap happens at atleast lag:3,2,1 &
lead: 1,2, this needs to be captured and tagged as case1 (or 1). Similarly, for row 5 and 6 atleast lag1 is overlapping, hence this needs to be captured, and tagged as Case2(or 2), along with Sales and Month value.
Now, row 7 is not overlapping with the previous or later consecutive row,hence it will not be captured.
Below is the result i am looking for part2.
Month,Sales,Case
1,2503,1
2,3734,1
3,6631,1
5,1889,2
6,4819,2
I want to run this for multiple groups, hence i will either incorporate dplyr or loop to get the result. Currently, i am simply looking for the approach.
Not sure how to solve this problem. First time i am looking to capture things at row level in R. I am not looking for any solution. Simply looking for a first step to counter this problem. Would appreciate any leads.
An option using rle for the 1st part of the calculation can be as:
df$count <- apply(df[,-c(1:4)],1,function(x){
first <- rle(x[1:7])
second <- rle(x[9:15])
count <- 0
if(first$values[length(first$values)] == 0){
count = first$lengths[length(first$values)]
}
if(second$values[1] == 0){
count = count+second$lengths[1]
}
count
})
df[,c("Month", "Sales", "count")]
# Month Sales count
# 1 1 2503 9
# 2 2 3734 3
# 3 3 6631 5
# 4 4 8606 0
# 5 5 1889 6
# 6 6 4819 1
# 7 7 5120 1
Data:
df <- read.table(text =
"Group,Date,Month,Sales,lag7,lag6,lag5,lag4,lag3,lag2,lag1,lag0(reference),lead1,lead2,lead3,lead4,lead5,lead6,lead7
Group1,42005,1,2503,1,1,0,0,0,0,0,0,0,0,0,0,1,0,1
Group1,42036,2,3734,1,1,1,1,1,0,0,0,0,1,1,0,0,0,0
Group1,42064,3,6631,1,0,0,1,0,0,0,0,0,0,1,1,1,1,0
Group1,42095,4,8606,0,1,0,1,1,0,1,0,1,1,1,0,0,0,0
Group1,42125,5,1889,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0
Group1,42156,6,4819,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0
Group1,42186,7,5120,0,0,1,1,1,1,1,0,0,1,1,0,1,1,0",
header = TRUE, stringsAsFactors = FALSE, sep = ",")

dplyr: Can a function called inside mutate find the element of a column from current row

I have a very large data frame and a set of adjustment coefficients that I wish to apply to certain years, with each coefficient applied to one and only one year. The code below tries, for each row, to select the right coefficient, and return a vector containing dat in the unaffected years and dat times that coefficient in the selected years, which is to replace dat.
year <- rep(1:5, times = c(2,2,2,2,2))
dat <- 1:10
df <- tibble(year, dat)
adjust = c(rep(0, 4), rep(c(1 + 0.1*1:3), c(2,2,2)))
df %>% mutate(dat = ifelse(year < 5, year, dat*adjust[[year - 2]]))
If I get to do this, I get the following error:
Evaluation error: attempt to select more than one element in vectorIndex.
I am pretty sure this is because the extraction operator [[ treats year as the entire vector year rather than the year of the current row, so there is then a vectorized subtraction, whereupon [[ chokes on the vector-valued index.
I know there are many ways to solve this problem. I have a particularly ugly way involving nested ifelse’s working now. My question is, is there any way to do what I was trying to do in an R- and dplyr- idiomatic way? In some ways this seems like a filter or group_by problem, since we want to treat rows or groups of rows as distinct entities, but I have not found a way of doing so that is any cleaner.
It seems like there are some functions which are easier to define or to think of as row-by-row rather than as the product of entire vectors. I could produce a single vector containing the correct adjustment for each year, but since the number of rows per year varies, I would still have to apply a multi-valued conditional test to construct that vector, so the same problem arises.
Or doesn’t it?
You need to use [ instead of [[ for vector indexing; And also year - 2 produces negative index which will further give problems; If you want to map year to adjust by index positions, you can use replace with a mask that indicates the year to be modified:
df %>%
mutate(dat = {
mask = year > 2;
replace(year, mask, dat[mask] * adjust[year[mask] - 2])
})
# A tibble: 10 x 2
# year1 dat1
# <int> <dbl>
# 1 1 1.0
# 2 1 1.0
# 3 2 2.0
# 4 2 2.0
# 5 3 5.5
# 6 3 6.6
# 7 4 8.4
# 8 4 9.6
# 9 5 11.7
#10 5 13.0

dataframe math to calculate profit based on indicator

Simple explanation: My Goal is to figure out how to get the profit column shown below. I am trying to calculate the difference in val for every pair of changing values (-1 to 1 and 1 to -1).
If the starting indicator is 1 or -1 store the value.
Find the next indicator that is opposite (so -1 on row 3). Save this val. Subtract the first value from it (.85.-.84). Store that in the profit column.
repeat
Specific to this case
Go until find next opposite val (on row 4). Save this value. Subtract values, save in profit column. ()
Go until find next opposite val (on row 8). Save this value. Subtract values, save in profit column.
Financial explanation (if it is useful)
I am trying to write a function to calculate profit given a column of values and a column of indicators (buy/sell/hold). I know this is implemented in a few of the big packages (quantmod, quantstrat), but I cant seem to find a simple way to do it.
df<-
data.frame(val=c(.84,.83,.85,.83,.83,.84,.85,.81),indicator=c(1,0,-1,1,0,1,1,-1))
df
val indicator profit
1 0.84 1 NA
2 0.83 0 NA
3 0.85 -1 .01 based on: (.85-.84) from 1 one to -1
4 0.83 1 .02 based on (.85-.83) from -1 to 1
5 0.83 0 NA
6 0.84 1 NA
7 0.85 1 NA
8 0.81 -1 -.02 based on (.81-.83) from last change (row 4) to now
Notes
multiple indicators should be ignored (1111) means nothing beyond the first one which should be stored. (line 4 is stored, lines 5,6,7 are not)
ignore 0, holds do not change profit calculation
I am happy to provide more info if needed.
It was easier for me to work it out in my head after splitting the problem into two parts, which correspond to the two loops shown below. The first part involves flagging the rows where there was a change in the indicator value, while the second part involves subtracting the val's from the relevant rows (i.e., those selected in part 1). FYI, I assume you meant to put -.02 for row 4 in your example? If not, then please clarify which rows get subtracted from which when calculating profit.
data.frame(val=c(.84,.83,.85,.83,.83,.84,.85,.81),
indicator=c(1,0,-1,1,0,1,1,-1)) -> x
x$num <- seq_along(x$val)
x$rollingProf <- NA
# start with indicator = 1
indicator <- 1
value <- .84
for (i in 1:(nrow(x) - 1)) {
x[i + 1, "indicator"] -> next_
if (indicator * -1 == next_) {
1 -> x[i + 1, "rollingProf"]
indicator <- next_
}
}
x[!is.na(x$rollingProf), c("val", "num")] -> q
for (i in 2:nrow(q)) {
q[i, "val"] - q[i - 1, "val"] -> q[i, "change"]
}

How to bin data based on values in one column, and sum occurrences from another column in R?

I have a dataframe df and want to bin rows using data from column A, and then for each bin, count the number of times that a value is present in another column B. Here is an example using only 2 columns (although my real example has many columns):
A B
5.4
4.6 36_8365
2.4
3.6
0.6
8.9 83_7433
4
7.6
4.7 54_3874
1.5 54_8364
I want look in column A, and find all values less than 1, greater than 1 but less than 2, and so on, and for each bin, I want to count the number of times that a value appears in column B. For the table above, this would give the following results:
Class Number
<1 0
1<=A<2 1
2<=A<3 0
3<=A<4 0
4<=A<5 2
5<=A<6 0
6<=A<7 0
7<=A<8 0
8<=A<9 1
9<=A<10 0
The following is close, but it will sum the values when instead I just want to count them:
with(df, sum(df[A >= 1 & A < 2, "B"]))
I'm not sure what to replace "sum" with to get just counts, instead of a sum. I know I can identify which rows in column B have a value by using
thing <- B==''
or make a table using
thing_table <- table(B=='')
However, I'm not sure how to search through column A, test if the value is between 2 other values, and then count the items in B that meet those criteria. Can anyone point me in the right direction?
Thanks!
First:
newdf<-na.omit(df)
This will shrink the df down to only rows with data in them. Make sure the empty cells are showing up as NAs before attempting.
Second:
Replace sum with length
with(newdf, length(newdf[A>=1 $ A < 2, "B"]))

Resources