I'm new to programming in R and I'm working with a huge dataset containing hundreds of variables and thousands of observations. Among these variables there is Age, which is my main concern. I want to get means for each other variables in function of Age. I can get smaller tables with this:
for(i in 18:84)
{
n<- sprintf("SortAgeM%d",i)
assign(x=n,subset(SortAgeM,subset=(SortAgeM$AGE>=i & SortAgeM$AGE<i+1)))
}
"SortAgeM85plus"<-subset(SortAgeM,subset=(SortAgeM$AGE>=85 & SortAgeM$AGE<100))
This gives me subdatasets for each age I'm concern with. I would then want to get the mean for each column. Each column is an observation of the volume of a specific brain region. I'm interested in knowing how is the volume decreasing with time and I would like to be able to know if individuals of a given age are close to the mean of their age or not.
Now, I would like to get one more row with the mean for each column. So I tried this:
for(i in 18:85) {
addmargins((SortAgeM%d,i), margin=1, FUN= "mean")
}
But it didn't work... I'm stuck and I'm not familiar enough with R function to find a solution on the net...
Thank you for your help.
Victor
Post answer edit: This is what I finally did:
for(i in 18:84)
{
n<- sprintf("SortAgeM%d",i)
assign(x=n,subset(SortAgeM,subset=(SortAgeM$AGE>=i & SortAgeM$AGE<i+1)))
Ajustment<-c(NA,NA,NA,NA,NA,NA,NA) #first variables aren't numeric
Line1<- colMeans(item[,8:217],na.rm=TRUE)
Line<-c(Ajustment,Ligne1)
assign(x=n, rbind(item,Ligne))
}
If you simply want an additional row with the means of each column, you can rbind the colMeans of your df like this
df_new <- rbind(df, colMeans(df))
Related
For a course at university where we learn how to do R, we have to filter a dataframe supplied (called crimes). The original dataframe has 8 columns.
I do not think I can supply the data set, since it is part of an assignment for school. But any advice would be really appreaciated.
The requirements of the tasks are to use a loop and an if-statement, to filter one column ("category") and take only the rows with one specific level (out of 14) (named "drugs"). Then printing only three out of the eight columns of those rows into a new dataframe.
for (i in crimes$category) {
if (i == "drugs") {
drugs <- rbind(drugs, crimes[c(2,3,7)])
}
}
Now I know the problem is in the rbind function, since it now just duplicates all rows 160 times (there are 160 rows with the category "drugs". But I do not no how to get a dataframe with 160 observations and only 3 variables.
I do not think I can supply the data set, since it is part of an assignment for school. But any advice would be really appreaciated.
Note that the assignment defeats the purpose of using R. But that said, use the for / if construct to get the row numbers containing the category value "drugs" then create the result df outside the loop:
keep <- integer()
for (i in crimes$category) {
if (i == "drugs") {
keep <- c(keep, i)
}
}
crimes2 <- crimes[keep, c(2,3,7)]
Note the base R no-loop solution would be:
crimes2 <- crimes[crimes$category == "drugs", c(2,3,7)]
I try to subset two columns ("nitrates" and "sulfate") from many files that have typical numbers of rows and columns. here is my code ..
pollutant <- if(pollutant == TRUE){
id[,"nitrate"]
} else {
id[,"sulfate"]
}
I should use these columns to count the meaning of these columns.
please give me a hand, I am a new comer to R
The function if only accept single values each times. In case pollutant is a data.frame structure or similar, the if loop is going fail. My suggestion is try to use the data.table environment. It makes life much more easier for what it seems you want to do (I don't completely get it from your text).
library(data.table)
pollutant <- data.table(pollutant)
Pollutat.Subset <- pollutant[id == "nitrate" | id =="sulfate",]
This should subset your data based on the identity of ID.
I have a data frame that contains data for different observations, where the observations are grouped with a unique code. As a reproducible example, here is how a simulated data looks like:
v <- c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2,2,3,3,4,4,4,4,4,5,5,5,5,5,5,5,5,6,6,6)
mat1 <- matrix(runif(200),40)
mat1 <- cbind(v,mat1)
mat1 <-as.data.frame(mat1)
names(mat1) <- c('code','x1','x2','x3','x4','x5')
unq <- unique(mat1$code)
What I would like to do is to calculate an average for each observation, based on two previous and two future observations (you can think about this as a time series). So for example
mat1$X1[3] = mean(mean(mat1$x1[1:5])
mat1$X1[4] = mean(mean(mat1$x1[2:6])
and so on. I am able to do the calculation using a particular code (for example when mat1$code==1):
K <- data.frame(code=mat1$code,x1=rep(0,40),x2=rep(0,40),x3=rep(0,40),x4=rep(0,40),x5=rep(0,40))
for ( i in 3:(nrow(mat1)-2)){
if(mat1$code[i]==unq[1]){
K[i,2] <- mean(mat1[i-2:i+2,2])
}
}
, but there are two things that I couldn't figure out:
(1) Since the actual dataset is much larger than the simulated one, how can I dynamically go through all the unique codes and do the calculation, noting that the first and last two observations of each unique code should be zero (and I will eventually get rid of them).
(2) The number of observations for each unique code is different, and some of them are less than 4, where in this case there can't be any calculation done for that code!
Any help is highly appreciated.
Thank you
I am dabbling my feet in R a bit and I am now able to sort columns by their means but I would now like to sort the columns by the biggest range of the data points in each column.
Say I have a table with a point rating for movies. How can I get the Top 10 movies where the opinions are most different. Is there a function that can measure this? One idea of mine was to use perhaps the minValue ans maxValue but then just one outlier can mess it all up.
I guess the size of the box of a boxplot could be a pretty good measure.
Any ideas?
Update:
So I tried sorting my table by their respective sd() but I cant quite get that to work. What I was trying is this. The table has headings btw. and is named newdata here.
> newdata.sd <- sapply(1:107, function(j) sd(newdata[,j], na.rm=TRUE))
> newdata.sorted.sd <- newdata[,names(sort(newdata.sd, decreasing=TRUE))]
The second line throws an error because the first doesn't produce a Named num vector.
When I did the same thing with sorting by the columns means it worked. That I did with the following two lines.
> newdata.mean <- colMeans(newdata, na.rm=TRUE)
> newdata.sorted <- newdata[,names(sort(newdata.mean, decreasing=TRUE))]
How can I produce a named vector of sd()s like in the second example?
A different way to sort by sd() would be also appreciated.
I'm learning to work with lists in R, I looked in the internet and in some books but I did not find out the solution.
I have a data frame with n rows and several columns. What I would like to do is a simple and quickly way to plot one column (e.g. value1) for each year (other column).
First, I created a list from a data.frame using, split
lst<-split(X, X$Year)
So now I have a subset of data frame divided for years, and that's fine.
But now, how can I create now plot of value1 for each year?
I tried to write a short script, but it doesn't work at all
lst<-split(X, X$Year)
for (i in names(lst)) {
plot(i$value1)
}
Is this what you are after:
library(xts)
data<-xts(data.frame(a=c(1,2,3), b=c(4,5,6)), c(as.POSIXct("1970-01-01"), as.POSIXct("1971-01-01"), as.POSIXct("1972-01-01")))
plot(data[,1], ylim=c(min(data), max(data)))
for (i in 2:ncol(data)) {
lines(data[,i])
}
Crude but works...