I try to subset two columns ("nitrates" and "sulfate") from many files that have typical numbers of rows and columns. here is my code ..
pollutant <- if(pollutant == TRUE){
id[,"nitrate"]
} else {
id[,"sulfate"]
}
I should use these columns to count the meaning of these columns.
please give me a hand, I am a new comer to R
The function if only accept single values each times. In case pollutant is a data.frame structure or similar, the if loop is going fail. My suggestion is try to use the data.table environment. It makes life much more easier for what it seems you want to do (I don't completely get it from your text).
library(data.table)
pollutant <- data.table(pollutant)
Pollutat.Subset <- pollutant[id == "nitrate" | id =="sulfate",]
This should subset your data based on the identity of ID.
Related
For a course at university where we learn how to do R, we have to filter a dataframe supplied (called crimes). The original dataframe has 8 columns.
I do not think I can supply the data set, since it is part of an assignment for school. But any advice would be really appreaciated.
The requirements of the tasks are to use a loop and an if-statement, to filter one column ("category") and take only the rows with one specific level (out of 14) (named "drugs"). Then printing only three out of the eight columns of those rows into a new dataframe.
for (i in crimes$category) {
if (i == "drugs") {
drugs <- rbind(drugs, crimes[c(2,3,7)])
}
}
Now I know the problem is in the rbind function, since it now just duplicates all rows 160 times (there are 160 rows with the category "drugs". But I do not no how to get a dataframe with 160 observations and only 3 variables.
I do not think I can supply the data set, since it is part of an assignment for school. But any advice would be really appreaciated.
Note that the assignment defeats the purpose of using R. But that said, use the for / if construct to get the row numbers containing the category value "drugs" then create the result df outside the loop:
keep <- integer()
for (i in crimes$category) {
if (i == "drugs") {
keep <- c(keep, i)
}
}
crimes2 <- crimes[keep, c(2,3,7)]
Note the base R no-loop solution would be:
crimes2 <- crimes[crimes$category == "drugs", c(2,3,7)]
I am having trouble creating a subset for a large dataframe. I need to extract all rows that match one of two correct cities in one of the columns, however any subset that I create ends up empty. Given the main dataframe, I try:
New = data[data$Home.port %in% c("ARDGLASS","NEWLYN")]
However R returns "undefined columns selected"
A comma is missing:
New = data[data$Home.port %in% c("ARDGLASS","NEWLYN"), ]
That is because you are selecting rows, not columns; if you leave out the comma, R tries to subset columns instead of rows.
I recommend to use data.table so:
# install.packages(data.table)
library(data.table)
data <- as.data.table(data)
new_data <- data[Home.port %in% c("ARDGLASS","NEWLYN")]
You can check this web to learn data.table is very fast with big data bases
The subset function will also do this task
new <- subset(data, subset = Home.port %in% c("ARDGLASS","NEWLYN"))
The base approach is functionally the same, its just a matter of using a declarative function for the task or not.
When using subset() the first argument is the data frame you want to subset. When you want to check for several variables you do not need to put "data$" in front. This save time and makes it easier to read.
datasubset <- subset(data, Home.port %in% c("ARDGLASS","NEWLYN"))
You can also use multiple conditions to subset use "&" for AND condition or "|" for OR condition depending on what you plan to do.
datasubset <- subset(data, Home.port == "ARDGLASS" & Home.port == "NEWLYN"))
I'm trying to see if R has a command similar to Stata. In Stata, the !mi(a, b, c,...) command creates a new variable and indicates a 1/0 if the indicated variable(s) have no missing data. 1 = no missing data across variables x, 0 = missing data in one of the variables x.
I'm looking for a simple code because sometimes I have about 15-20 variables (mainly to mark listwise deletion cases). It takes a little more work but I specify the column names instead of using the : marker. The options I've found creates a new dataframe (na.omit), but I want to retain all the cases.
I know that ifelse can accomplish this using:
df$test <- ifelse(!is.na(df$ID) & !is.na(df$STATUS), 1,0)
I like to know if there's another way with less code where I don't need to write "!is.na(df$ )" over and over. Maybe a $global code (similar to Stata)?
You should be able to do this using complete.cases
df$test <- as.numeric(complete.cases(df))
You could also use rowSums:
df$test <- as.numeric(rowSums(is.na(df)) == 0)
I'm new to programming in R and I'm working with a huge dataset containing hundreds of variables and thousands of observations. Among these variables there is Age, which is my main concern. I want to get means for each other variables in function of Age. I can get smaller tables with this:
for(i in 18:84)
{
n<- sprintf("SortAgeM%d",i)
assign(x=n,subset(SortAgeM,subset=(SortAgeM$AGE>=i & SortAgeM$AGE<i+1)))
}
"SortAgeM85plus"<-subset(SortAgeM,subset=(SortAgeM$AGE>=85 & SortAgeM$AGE<100))
This gives me subdatasets for each age I'm concern with. I would then want to get the mean for each column. Each column is an observation of the volume of a specific brain region. I'm interested in knowing how is the volume decreasing with time and I would like to be able to know if individuals of a given age are close to the mean of their age or not.
Now, I would like to get one more row with the mean for each column. So I tried this:
for(i in 18:85) {
addmargins((SortAgeM%d,i), margin=1, FUN= "mean")
}
But it didn't work... I'm stuck and I'm not familiar enough with R function to find a solution on the net...
Thank you for your help.
Victor
Post answer edit: This is what I finally did:
for(i in 18:84)
{
n<- sprintf("SortAgeM%d",i)
assign(x=n,subset(SortAgeM,subset=(SortAgeM$AGE>=i & SortAgeM$AGE<i+1)))
Ajustment<-c(NA,NA,NA,NA,NA,NA,NA) #first variables aren't numeric
Line1<- colMeans(item[,8:217],na.rm=TRUE)
Line<-c(Ajustment,Ligne1)
assign(x=n, rbind(item,Ligne))
}
If you simply want an additional row with the means of each column, you can rbind the colMeans of your df like this
df_new <- rbind(df, colMeans(df))
I wrote the following code to extract multiple datasets out of one large dataset based on the column Time.
for(i in 1:nrow(position)) {
assign(paste("position.",i,sep=""), subset(dataset, Time >= position[i,1] & Time <= position[i,2])
)
}
(position is a list which contains the starttime[,1] and stoptime[,2])
The outputs are subsets of my original dataset and looke like:
position.1
position.2
position.3
....
Is there a possibility to add an extra column to each of the new datasets (position.1, position.2, ...) Which defines them by a number?
eg: position.1 has an extra column with value 1, position.2 has an extra column with value 2, and so on.
I need those numbers to identify the datasets (position.1, position.2, ...) after I rbind them in a last step to on dataset again.
Since you don't provide example data, this is untested, but should work for you:
dflist <-
lapply(1:nrow(position), function(x) {
within(dataset[dataset$Time >= position[x,1] & dataset$Time <= position[x,2],], val = x)
}
do.call(rbind, dflist)
Basically, you never want to take the strategy you propose of assigning multiple numbered objects to the global environment. It is much easier to store all of the subsets in a list and then bind them back together using do.call(rbind, dflist). This is more efficiently, produces less clutter in your workspace, and is a more "functional" style of programming.
In addition to Thomas's recommendation to avoid side effects, you might want to take advantage of existing packages that detect overlaps. The IRanges package in Bioconductor can detect overlaps between one set of ranges (position) and another set of ranges or positions (dataset$Time). This gets you the matches between the time points and the ranges:
r <- IRanges(position[[1L]], position[[2L]])
hits <- findOverlaps(dataset$Time, r)
Now, you want to extract a subset of the dataset that overlaps each range in position. We can group the query (Time) indices by the subject (position) indices and extract a list from the dataset using that grouping:
dataset <- DataFrame(dataset)
l <- extractList(dataset, split(queryHits(hits), subjectHits(hits)))
To get the final answer, we need to combine the list elements row-wise, while adding a column that denotes their group membership:
ans <- stack(l)