Trying to find mean of each column in a data set - r

Hello everyone I am fairly new to r programming and i was wondering if someone could help me out. I was just playing with r and wanted to make a function that returned a vector of the means of each column in a data set that the user would put in as an argument. The problem is I am trying to do it without the mean r the apply functions so I am just manually trying it out and feel I am very close to finishing it. Just wanted to ask if someone could check it to see where I made an error.
Here is my code:
findMeans<- function(data)
{
meanVec <- numeric()
for(i in 1:6)
{
mean=0
for( j in 1:153)
{
value=0
count=0
if(is.na(data[j,i])==FALSE)
{
value= value + data[i,j]
count=count+1
}
else
{
value= value +0
}
}
mean =value/count
meanVec[i]<-mean
}
meanVec
}
and when I try to list the vector it just gives this
> meanVec
numeric(0)
could anyone possibly shed some light on what I am doing wrong?

If you're looking for function writing practice, and are already aware of the colMeans function, there's a couple errors I spotted.
1) I assume that when you're going from 1:6, you're going through each column in your data frame, and 1:153, you're going through each row. If this is accurate, your value=0 and count = 0 statements should be moved a level up, next to mean = 0. Otherwise, you're resetting the value to zero every row you go through, which won't do anything but report the last value it comes across.
2) In the line value= value + data[i,j], you need data[j,i] instead. You reversed the row and column values.
With those two changes, your function seems to work for a data set with 6 columns and 153 rows. For more practice, I'd recommend trying to find a way to generalize the function for any number of columns and rows.

Related

Mean value for different groups

I am stuck with a 'for' loop and would greatly appreciate some help.
I have a dataframe, called 'df' including data for the number of people per household (household_size), ranging from 0 (I replaced the missing values with a 0) to 8, as well as the number of car.
My aim is to write a quick code that computes the average number of cars depending on the household size.
I tried the following:
avg <- function(df){
i <- df$household_size
for (i in 0 : 8){
print(mean(df$car))
}
}
I'm pretty sure I'm missing something really basic here, but I don't know what.
Thanks everyone for your input.
I wouldn't have used a function for this. However, this is an exercise as part of an introductory coding with R module that specifically requires a for-loop.
Here a solution to print the mean for each size group using a for loop. Let me know if it worked
for(i in unique(df$household_size)){
print(paste(i,' : ',mean(df[df$household_size%in%i,car])))
}
As mentioned in a comment, I took away the function part because I don't see the point of having it. But if it's mandatory, you can use lapply, that behaves a bit like a for loop according to me:
lapply(unique(df$household_size), function(i){
return(paste(i,' : ',mean(df[df$household_size%in%i,car])))
}
)

How to refer to previous row in data table (R) without using shift function?

I am trying to use data table to make my massive modeling work easier.
I ran into this problem but have not figured out a better way to solve it.
Here's my example:
set.seed(1001)
dt1<-data.table("Country"=c("Algeria","Mongolia"),
"Year"=c(2000:2020),
"var1"=runif(20,10000,45000),
"var2"=0,
"var3"=0,
"var4"=0)
setorder(dt1,Country)
I would like to update var2 by using previous row's values in a specific period
so I tried
dt1[Year>2000&Year<2010,var2:=var1[i]-sum(.SD)[i-1],by=Country,.SDcols=c(var2,var3,var4)]
Obviously this did not work. The problem is var 2 and the sum of other variables in previous row need to be updated simultaneously. So I don't think shift function would do the work.
For now, I am using very cumbersome for loop for this but it works.Here is my forloop
for(i in 1:nrow(dt1)){
if (i <=10){
for (j in 4:6){
if(j==4){
dt1[[i,j]]=dt1[[i,3]]-rowSums(dt1[[i-1,c(4:6)]])
}
else{
dt1[[i,j]]=dt1[[i-1,j-1]] * 0.001
}
}
}
}
Any suggestion will be much appreciated!

Cannot figure out how to use IF statement

I want to create a categorical variable for my DB: I want to create the "Same_Region" group, that includes all the people that live and work in the same Region and a "Diff_Region" for those who don't. I tried to use the IF statement, but I actually don't know how to proper say "if the variable Region of residence and Region of work are the same, return...". It's the very first time I try to approach by my self R, and I feel a lil bit lost.
I tried to put the two variables (Made by 2 letters - f.i. "BO") as Characters and use the "grep" command. But it eventually took to no results.
Then I tried by putting both the variables as factors, and nothing much changed.
----In R-----
extractSamepr <- function(RegionOfRes, RegionOfWo){
if(RegionOfRes== RegionOfWo){
return("SamePr")
}
else {
return("DiffPr")
}
SamePr <- NULL
for (i in 1:nrow(Data.Base)) {
SamePr <- c(SamePr, extractSamepr(Data.Base[i, "RegionOfRes", "RegionOfWo"]))
}
The ifelse way proposed in #deepseefan's comment is a standard way of solving this type of problem.
Here is another one. It uses the fact that FALSE/TRUE are coded as integers 0/1 to create a logical vector based on equality and then add 1 to that vector, giving a vector of 1/2 values. This result is used in the function's final instruction to index a vector with the two possible outcomes.
extractSamepr <- function(DF){
i <- 1 + (DF[["RegionOfRes"]] == DF[["RegionOfWo"]])
c("DiffPr", "SamePr")[i]
}
Data.Base$SamePr <- extractSamepr(Data.Base)

Multiple regressions with loop in loop in R

I want to run the following regressions, the variable which has the problem is EP, is a dummy variable and I must to check different cases, z (lenght=1000) is the threshold variable. Ι want to crate 1000 different variables of EP from z variable and save the coefficients. I use a loop in loop but the results are completely wrong.The code runs properly and does not make an error. The square brackets and parentheses are the code I run. The problem is that there is a huge delay and the results after two hours still running.
I reduced the sample by 99% and again I did not get a result, the code ran without problem .
I do not want anything special, just for each value of z to run a different regression and end up to stored the estimates. I can not understand why take so long. Any idea?
for (k in 1:1000){
z<-u[k]
for (i in 1:length(dS)){
if (dS[i]>=z) {
EP[i]=1
} else {
EP[i]=0
}
fitT <- dynlm(dR ~ L(dR,1)+L(EN)+L(EP)+L(ΚΜ,1)
prob[[k]] <- summary(fitT)$coefficients[1, 2]
}
You don't have a closing } for the i-loop; you also don't have a closing ) for dynlm.
Note, you can really replace your i-loop by
EP <- as.integer(dS >= z)
Next time when asking question, be clear and specific. What do you mean by "I use a loop in loop but the results are completely wrong"? Error message, etc?

store results of for loop in unique objects

Here is a simple loop
for (i in seq(1,30)) {
mdl<-i
}
How do I get 30 mdl rather than just one mdl (which is happening because within the loop, mdli is being replaced by mdli+1 at every iteration. What I want is to have 30 mdl perhaps with names like mdl1, mdl2 ....mdl30
I tried this:
for (i in seq(1,30)) {
mdli<-i
}
But if I type mdl1, it says mdl1 not found whereas typing mdli gives me the value of i=5
Thank you
You can specify your store variable beforhand without determine how many values it shall store. If you want for each value a seperate variable take a look at the paste function.
x<- NULL
for (i in 1:10){
x[i] <- i*2
}
*edit: The comment above is right. This way is not the most efficent one. But I still use it when computation time is not an issue.

Resources