R programming Sum data frame - r

i have a code using R language, i want to sum all data frame (df$number is unlist result in 'res')
total result is = [1] 1 3 5 7 9 20 31 42
digits <- function(x){as.integer(substring(x, seq(nchar(x)), seq(nchar(x))))}
generated <- function(x){ x + sum(digits(x))}
digitadition <- function(x,N) { c(x, replicate(N-1, x <<- generated(x))) }
res <- NULL
for(i in 0:50){
for(j in 2:50){
tmp <- digitadition(i,j)
IND <- 50*(i-1) + (j-1) - (i-1) #to index results
res[IND] <- tmp[length(tmp)]
}
}
df <- data.frame(number = unlist(res), generator=rep(1:50, each=49), N=2:50)
total <- table(df$number)[as.numeric(names(table(df$number)))<=50]
setdiff(1:50, as.numeric(names(total)))
sum(total)
i'm using sum(total) but the result of summary is '155' it is not the right answer, cause the right answer is '118'
what the spesific code to sum the 'total'?
thank you.

I ran your code and I think you may be confused on what you want to sum.
You setdiff contains the values 1 3 5 7 9 20 31 42 which sum is 118.
So, if you do sum(setdiff(1:50, as.numeric(names(total)))), you'll get the 118 you are looking for.
Your total variable is different from this. Let me explain what you are doing and what I think you should do.
Your code: total <- table(df$number)[as.numeric(names(table(df$number)))<=50]]
When you table(), you get each unique value from the vector, and the number of how many times this number appears on your vector.
And when you get the names() of this table, you get each of these unique values as a character, that's why you are setting as.numeric.
But the function unique() do this job for you, he extracts uniques values from a vector.
Here's what you can do: total <- unique(df$number[which(df$number <= 50)])
Where which() get the ID's of values <= 50, and unique extracts unique values of these ID's.
And finally: sum(setdiff(1:50, total)) that sums all the values from 1 to 50 that are not in your total vector.
And in my opinion, sum(setdiff(total, 1:50)) its more intuitive.

Related

Is there an R function equivalent to Excel's $ for "keep reference cell constant" [duplicate]

This question already has answers here:
Divide each data frame row by vector in R
(5 answers)
Closed 2 years ago.
I'm new to R and I've done my best googling for the answer to the question below, but nothing has come up so far.
In Excel you can keep a specific column or row constant when using a reference by putting $ before the row number or column letter. This is handy when performing operations across many cells when all cells are referring to something in a single other cell. For example, take a dataset with grades in a course: Row 1 has the total number of points per class assignment (each column is an assignment), and Rows 2:31 are the raw scores for each of 30 students. In Excel, to calculate percentage correct, I take each student's score for that assignment and refer it to the first row, holding row constant in the reference so I can drag down and apply that operation to all 30 rows below Row 1. Most importantly, in Excel I can also drag right to do this across all columns, without having to type a new operation.
What is the most efficient way to perform this operation--holding a reference row constant while performing an operation to all other rows, then applying this across columns while still holding the reference row constant--in R? So far I had to slice the reference row to a new dataframe, remove that row from the original dataframe, then type one operation per column while manually going back to the new dataframe to look up the reference number to apply for that column's operation. See my super-tedious code below.
For reference, each column is an assignment, and Row 1 had the number of points possible for that assignment. All subsequent rows were individual students and their grades.
# Extract number of points possible
outof <- slice(grades, 1)
# Now remove that row (Row 1)
grades <- grades[-c(1),]
# Turn number correct into percentage. The divided by
# number is from the sliced Row 1, which I had to
# look up and type one-by-one. I'm hoping there is
# code to do this automatically in R.
grades$ExamFinal < (grades$ExamFinal / 34) * 100
grades$Exam3 <- (grades$Exam3 / 26) * 100
grades$Exam4 <- (grades$Exam4 / 31) * 100
grades$q1.1 <- grades$q1.1 / 6
grades$q1.2 <- grades$q1.2 / 10
grades$q1.3 < grades$q1.3 / 6
grades$q2.2 <- grades$q2.2 / 3
grades$q2.4 <- grades$q2.4 / 12
grades$q3.1 <- grades$q3.1 / 9
grades$q3.2 <- grades$q3.2 / 8
grades$q3.3 <- grades$q3.3 / 12
grades$q4.1 <- grades$q4.1 / 13
grades$q4.2 <- grades$q4.2 / 5
grades$q6.1 <- grades$q6.1 / 5
grades$q6.2 <- grades$q6.2 / 6
grades$q6.3 <- grades$q6.3 / 11
grades$q7.1 <- grades$q7.1 / 7
grades$q7.2 <- grades$q7.2 / 8
grades$q8.1 <- grades$q8.1 / 7
grades$q8.3 <- grades$q8.3 / 13
grades$q9.2 <- grades$q9.2 / 13
grades$q10.1 <- grades$q10.1 / 8
grades$q12.1 <- grades$q12.1 / 12
You can use sweep
100*sweep(grades, 2, outof, "/")
# ExamFinal EXam3 EXam4
#1 100.00 76.92 32.26
#2 88.24 84.62 64.52
#3 29.41 100.00 96.77
Data:
grades
ExamFinal EXam3 EXam4
1 34 20 10
2 30 22 20
3 10 26 30
outof
[1] 34 26 31
grades <- data.frame(ExamFinal=c(34,30,10),
EXam3=c(20,22,26),
EXam4=c(10,20,30))
outof <- c(34,26,31)
You can use mapply on the original grades dataframe (don't remove the first row) to divide rows by the first row. Then convert the result back to a dataframe.
as.data.frame(mapply("/", grades[2:31, ], grades[1, ]))
The easiest way is to use some type of loop. In this case I am using the sapply function. To all of the elements in each column by the corresponding total score.
#Example data
outof<-data.frame(q1=c(3), q2=c(5))
grades<-data.frame(q1=c(1,2,3), q2=c(4,4, 5))
answermatrix <-sapply(1:ncol(grades), function(i) {
#grades[,i]/outof[i] #use this if "outof" is a vector
grades[,i]/outof[ ,i]
})
answermatrix
A loop would probably be your best bet.
The first part you would want to extract the most amount of points possible, as is listed in the first row, then use that number to calculate the percentage in the remaining rows per column:
`
j = 2 #sets the first row to 2 for later
for (i in 1:ncol(df) {
a <- df[1,] #this pulls the total points into a
#then we compute using that number
while(j <= nrow(df)-1){ #subtract the number of rows from removing the first
#row
b <- df[j,i] #gets the number per row per column that corresponds with each
#student
df[j,i] <- ((a/b)*100) #replaces that row,column with that percentage
j <- j+1 #goes to next row
}
}
`
The only drawback to this approach is data-frames produced in functions aren't copied to the global environment, but that can be fixed by introducing a function like so:
f1 <- function(x = <name of df> ,y= <name you want the completed df to be
called>) {
j = 2
for (i in 1:ncol(x) {
a <- x[1,]
while(j <= nrow(x)-1){
b <- df[j,i]
x[j,i] <- ((a/b)*100)
i <- i+1
}
}
arg_name <- deparse(substitute(y)) #gets argument name
var_name <- paste(arg_name) #construct the name
assign(var_name, x, env=.GlobalEnv) #produces global dataframe
}

R function to subset dataframe so that non-adjacent values in a column differ by >= X (starting with the first value)

I am looking for a function that iterates through the rows of a given column ("pos" for position, ascending) in a dataframe, and only keeps those rows whose values are at least let's say 10 different, starting with the first row.Thus it would start with the first row (and store it), and then carry on until it finds a row with a value at least 10 higher than the first, store this row, then start from this value again looking for the next >10diff one.
So far I have an R for loop that successfully finds adjacent rows at least X values apart, but it does not have the capability of looking any further than one row down, nor of stopping once it has found the given row and starting again from there.
Here is the function I have:
# example data frame
df <- data.frame(x=c(1:1000), pos=sort(sample(1:10000, 1000)))
# prep function (this only checks row above)
library(dplyr)
pos.apart.subset <- function(df, pos.diff) {
# create new dfs to store output
new.df <- list()
new.df1 <- data.frame()
# iterate through each row of df
for (i in 1:nrow(df)) {
# if the value of next row is higher or equal than value or row i+posdiff, keep
# if not ascending, keep
# if first row, keep
if(isTRUE(df$pos[i+1] >= df$pos[i]+pos.diff | df$pos[i+1] < df$pos[i] | i==1 )) {
# add rows that meet conditions to list
new.df[[i]] <- df[i,] }
}
# bind all rows that met conditions
new.df1 <- bind_rows(new.df)
return(new.df1)}
# test run for pos column adjacent values to be at least 10 apart
df1 <- pos.apart.subset(df, 10); head(df1)
Happy to do this in awk or any other language. Many thanks.
It seems I misunderstood the question earlier since we don't want to calculate the difference between consecutive rows, you can try :
nrows <- 1
previous_match <- 1
for(i in 2:nrow(df)) {
if(df$pos[i] - df$pos[previous_match] > 10) {
nrows <- c(nrows, i)
previous_match <- i
}
}
and then subset the selected rows :
df[nrows, ]
Earlier answer
We can use diff to get the difference between consecutive rows and select the row which has difference of greater than 10.
head(subset(df, c(TRUE, diff(pos) > 10)))
# x pos
#1 1 1
#2 2 31
#6 6 71
#9 9 134
#10 10 151
#13 13 185
The first TRUE is to by default select the first row.
In dplyr, we can use lag to get value from previous row :
library(dplyr)
df %>% filter(pos - lag(pos, default = -Inf) > 10)

Get ranges of dataframe given an input value (i.e. 1 returns df[1:10,])

I have a dataframe df, and I want to create a function in R that returns ranges of 10 entries of the dataframe given an input number. That is:
If input number is equal to 1, the function returns df[1:10,]
If input number is equal to 2, the function returns df[11:20,]
If input number is equal to 3, the function returns df[21:30,]
...
Like they were pages: page 1 shows ten entries, page 2 shows next ten entries, and so on.
Note:
if there're no more "ten entries" to return, the function should return all what's left in the dataframe
the lenght of the dataframe is not fixed (i.e. the function asks for the df to use and the "page" to return).
It looks pretty simple to implement but I cannot figure out how to do it in a proper and fast way.
Edit
I meant returning the rows not columns, sorry. Just edited. But #Freakazoid solution does more or less the trick, just changing the ncol by nrow (see his solution below)
The following function does the trick:
df <- data.frame(matrix(rnorm(1020), nrow=54, ncol=3))
batch_df <- function(df, batch_part) {
nbr_row <- nrow(df)
batch_size <- 10
nbr_of_batchs <- as.integer(nbr_row/batch_size)
last_batch_size <- (nbr_row - nbr_of_batchs*batch_size)
batch_indizes <- c(rep(1:nbr_of_batchs, each=batch_size),
rep(nbr_of_batchs+1, last_batch_size))
if(all(batch_part %in% batch_indizes)) {
row_index <- which(batch_indizes %in% c(batch_part))
ret_df <- df[ row_index,]
} else {
ret_df <- data.frame()
}
return(ret_df)
}
batch_df(df, 3)
The function first defines indices for rows. With these indices the function will search for the batch_part you want to select.
The function can not only take a single number; it can be a vector given where you can select multiple batch parts at once.
Output:
X1 X2 X3
21 0.7168950 0.88057886 0.1659177
22 -1.0560819 -0.53230247 -0.4204708
23 0.4835649 -1.43453719 0.1563253
24 0.1266011 1.22149179 -0.7924120
25 0.3982262 -0.59821992 -1.1645105
26 -0.4809448 0.42533877 0.2359328
27 -0.1530060 -0.23762552 0.9832919
28 0.8808083 -0.06004995 -1.0810818
29 -0.2924377 -1.23812802 -0.9057353
30 -0.2420152 -0.52037258 0.7406486
Given input number i, try
j <- i * 10
max <- pmin(j, nrow(df))
df[(j-9):max, ]

Creating function to read data set and columns and displyaing nrow

I am struggling a bit with a probably fairly simple task. I wanted to create a function that has arguments of dataframe(df), column names of dataframe(T and R), value of the selected column of dataframe(a and b). I know that the function reads the dataframe. but , I don't know how the columns are selected. I'm getting an error.
fun <- function(df,T,a,R,b)
{
col <- ds[c("x","y")]
omit <- na.omit(col)
data1 <- omit[omit$x == 'a',]
data2 <- omit[omit$x == 'b',]
nrow(data2)/nrow(data1)
}
fun(jugs,Place,UK,Price,10)
I'm new to r language. So, please help me.
There are several errors you're making.
col <- ds[c("x","y")]
What are x and y? Presumably they're arguments that you're passing, but you specify T and R in your function, not x and y.
data1 <- omit[omit$x == 'a',]
data2 <- omit[omit$x == 'b',]
Again, presumably, you want a and b to be arguments you passed to the function, but you specified 'a' and 'b' which are specific, not general arguments. Also, I assume that second "omit$x" should be "omit$y" (or vice versa). And actually, since you just made this into a new data frame with two columns, you can just use the column index.
nrow(data2)/nrow(data1)
You should print this line, or return it. Either one should suffice.
fun(jugs,Place,UK,Price,10)
Finally, you should use quotes on Place, UK, and Price, at least the way I've done it.
fun <- function(df, col1, val1, col2, val2){
new_cols <- df[,c(col1, col2)]
omit <- na.omit(new_cols)
data1 <- omit[omit[,1] == val1,]
data2 <- omit[omit[,2] == val2,]
print(nrow(data2)/nrow(data1))
}
fun(jugs, "Place", "UK", "Price", 10)
And if I understand what you're trying to do, it may be easier to avoid creating multiple dataframes that you don't need and just use counts instead.
fun <- function(df, col1, val1, col2, val2){
new_cols <- df[,c(col1, col2)]
omit <- na.omit(new_cols)
n1 <- sum(omit[,1] == val1)
n2 <- sum(omit[,2] == val2)
print(n2/n1)
}
fun(jugs, "Place", "UK", "Price", 10)
I would write this function as follows:
fun <- function(df,T,a,R,b) {
data <- na.omit(df[c(T,R)]);
sum(data[[R]]==b)/sum(data[[T]]==a);
};
As you can see, you can combine the first two lines into one, because in your code col was not reused anywhere. Secondly, since you only care about the number of rows of the two subsets of the intermediate data.frame, you don't actually need to construct those two data.frames; instead, you can just compute the logical vectors that result from the two comparisons, and then call sum() on those logical vectors, which naturally treats FALSE as 0 and TRUE as 1.
Demo:
fun <- function(df,T,a,R,b) { data <- na.omit(df[c(T,R)]); sum(data[[R]]==b)/sum(data[[T]]==a); };
df <- data.frame(place=c(rep(c('p1','p2'),each=4),NA,NA), price=c(10,10,20,NA,20,20,20,NA,20,20), stringsAsFactors=F );
df;
## place price
## 1 p1 10
## 2 p1 10
## 3 p1 20
## 4 p1 NA
## 5 p2 20
## 6 p2 20
## 7 p2 20
## 8 p2 NA
## 9 <NA> 20
## 10 <NA> 20
fun(df,'place','p1','price',20);
## [1] 1.333333

Better way to improve the for loop for my case in R?

Prob stat :
data set holds two columns mstr_program_list, Loc_cat with 600000. Loc_cat column holds both missing and non missing cells. Other columns are not havings NA's. for each prog in mstr_program_list, need to find total number of loc-cat associated with that program, % of non missing rows and among non missing rows find count of categories its divided into.
Ex : for Unknown prog - total number of rows = 3, non missing rows in loc_cat is one therefore % is (2/3)*100 and number of categories divided into is two (Rests:full) (Rests:lim)
> head(data)
L.Name mstr_program_list loc_cat
1 Six J'SGroup Unknown <NA>
2 Bj's- Maine Roasted Tomat Rests: Full
3 Bj's- Maine Unknown Rests: Full
4 Brad's Q Q Unknown Rests: lim
expected output:
mstr_prog total_count %good(non missing rows) Number of loc_cat
Unknown 3 66.7 2
the code below is taking a lot of time. In fact results are not showing. Can anyone help me to improve the code. Prob with this code as per my view is the adding vectors.
Upon research I came to know to add values to vector not to use append and go with c()
v <- c(v, 'y') # adding elements into a vector
Code:
data <- read.csv("MgData.csv",header=T, na.strings="", colClasses = classes, nrows = 600338,comment.char="") ## import data.
data_NoNull <- na.omit(data)
mpl_unique <- unique(data$mstr_program_list)
mas_Prog_List <- as.character()
loc_Count <- as.numeric()
per_Seg <- as.numeric()
num_Seg <- as.numeric()
for(i in 1:length(mpl_unique)) {
l_t <- length(data$mstr_program_list[data$mstr_program_list == i]) # loc_cat specific to prog
l_g <- length(data_NoNull$mstr_program_list[data_NoNull$mstr_program_list == i]) ## to know filled ones excluding empty
s <- subset(data_NoNull, mstr_program_list==i, select =c(loc_cat))
if((any(i == mas_Prog_List)) == FALSE) {
no_Seg <- nrow(unique(s))
mas_Prog_List <- c(mas_prog_list, i) # Adding values to vector
loc_Count <- c(loc_count, l_t)
perct_Seg <- ((l_g/l_t)*100)
per_Seg <- c(per_Seg, perct_seg)
Num_Seg <- c(Num_Seg, no_seg)
}
}
}
Seg_analysis <- data.frame(mas_Prog_List, loc_Count, per_Seg, num_Seg)
I am new to R. Correct me with changes in the code, naming convention/ terminology used.
Thanks

Resources