Remove rows from an R Data frame - r

I have a data set that has a number of columns, but to keep it short here's an abbreviated form (the data is from the Divvy competition)
Trip ID Tripduration from_id to_id
1 50 2 2
2 700 2 5
3 80 2 4
When I imported the data from the .csv R made it into a data.frame, which is OK. So using
full.set2<-sapply(full.set, function(x)
if(is.factor(x)){
as.numeric(x)
}else
{
x
})
I was able to convert the entire thing into a "Large Matrix" (according to RStudio). So Now I'm trying to clear out the values that meet 2 criteria:
1) Tripduration <= 90
&&
2) from_id == to_id
When I do
full.set2t<-full.set2[full.set2[,2]>=90]
It makes full.set2t into one very large vector rather than keeping it as a matrix (though it does look like it might be removing the proper values, as the number of elements decreased).
I've also tried subset on the original data.frame but I got the error that "> not meaningful for factors"
Any ideas? I've searched around and can't seem to get any of the other solutions I'v efound to work
EDIT: As I'm continuing searching I'll put here other things I've tried that didn't work:
x<-seq(1:90)
x<-as.numeric(x)
y<- full.set[! full.set$tripduration %in% x,]
## Does something, removes some data points but not all of the proper ones
Solution found!
full.set$tripduration<-as.numeric(full.set$tripduration)
full.set.test<-full.set[full.set$tripduration>90]
Turns out that the column was a factor and not numeric, and I didn't know how to convert that single column

The problem is this line
full.set2t<-full.set2[full.set2[,2]>=90]
To subset a data.frame you need to use [rows,columns], where leaving one blank means select eveything. So the line should be
full.set2t<-full.set2[full.set2[,2]>=90,] # note the comma

Related

Rowsums isn't adding correctly?

I have a presence absence database with a bunch of zeroes and ones, but when i use rowsums, it seems to only count a portion of the data and then stop. Here's my code
site_matrix=read.csv("TriassicMatrix1.csv", header=T) # create object called site_matrix
summary(site_matrix) # get summary
head(site_matrix) # check out first few columns
tail(site_matrix) # check out last few columns
View(site_matrix) # take a look at whole dataset in new window
# here's the problematic line
spp_rich=rowSums(site_matrix [,2:25]) # generate richness for sites
There are 25 rows of data, and it gives me incorrect output, such as suggesting the first row only has 4 occurances when it has 7.
I tried changing it to [,1,25] and it won't work since row 1 is my title row, so I know it's not that. When I view the data within R i can very easily go to row 2 and count out the data, since there is only a few hundred columns.
It appears to be 'cutting off' at about the halfway point, column-wise.

R: Dropping variables using number of observations

I have a large dataset, and I'm trying to drop some of my variables based on how many observations each has. For instance, I would like to drop any variable in my dataframe where n < 3 (total observations for that variable is less than 3). Since R can count observations for each variable using describe, can't I use that number to subset the data instead of having to type in each variable name each time I pull in a new version (each version has different variables that will have low n's and there are over 40 variables). Thanks so much for your help!
For instance, my data looks like this:
ID Runaway Aggressive Emergency Hospitalization Injury
1 3 NA 4 1 NA
2 NA NA 2 1 NA
3 4 NA 6 2 3
4 1 NA 1 1 NA
I want to be able to drop "Aggressive" and "Injury" based on their n's being 0 and 1 respectively. However, instead of telling R to drop them by variable name, it would be much more convenient if it was possible to tell R to drop any variable where n < 3 (or whatever number I choose) as I'll be using this code for multiple versions of this dataset. I have tried using column numbers (which is better than writing them out) but it's still pretty tedious when I have to describe() the data, figure out which variables have low n's, and then drop 28 variables or subset() around them.
This works but it's cumbersome...
UIRCorrelation <- UIRKidUnique61[c(28, 30, 32, 34:38, 42, 54:74)]
For some reason, my example looks different when I'm editing versus when I save so I also included an image of it. Sorry. This is the first time I've ever used stack overflow to ask a question. I actually spent a lot of time googling this but couldn't find an answer relating to n.
This line did not work: DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
DF being your dataframe
DF[, sapply(DF, function(col) length(na.omit(col))) > 4]
This function did the trick:
valid <- function(x) {sum(!is.na(x))}
N <- apply(UIRCorrelation,2,valid)
UIRCorrelation2 <- UIRCorrelation[N > 3]

Counting NA values by ID?

I'm learning R from scratch right now and am trying to count the number of NA's within a given table, aggregated by the ID of the file it came from. I then want to output that information in a new data frame, showing just the ID and the sum of the NA lines contained within. I've looked at some similar questions, but they all seem to deal with very short datasets, whereas mine is comparably long (10k + lines) so I can't call out each individual line to aggregate.
Ideally, if I start with a data table called "Data" with a total of four columns, and one column called "ID", I would like to output a data frame that is simply:
[ID] [NA_Count]
1 500
2 352
3 100
Thanks in advance...
Something like the following should work, although I am assuming that Date is always there and Field 1 and Field 2 are numeric:
# get file names and initialize a vector for the counts
fileNames <- list.files(<filePath>)
missRowsVec <- integer(length(fileNames))
# loop through files, get number of
for(filePos in 1:length(fileNames)) {
# read in files **fill in <filePath>**
temp <- read.csv(paste0(<filePath>, fileNames[filePos]), as.is=TRUE)
# count the number of rows with missing values,
# ** fill in <fieldName#> with strings of variable names **
missRowsVec[filePos] <- sum(apply(temp[, c(<field1Name>, <field2Name>)],
function(i) anyNA(i)))
} # end loop
# build data frame
myDataFrame <- data.frame("fileNames"=fileNames, "missCount"=missRowsVec)
This may be a bit dense, but it should work more or less. Try small portions of it, like just some inner function, to see how stuff works.

Filling Gaps in Time Series Data in R

So this question has been bugging me for a while since I've been looking for an efficient way of doing it. Basically, I have a dataframe, with a data sample from an experiment in each row. I guess this should be looked at more as a log file from an experiment than the final version of the data for analyses.
The problem that I have is that, from time to time, certain events get logged in a column of the data. To make the analyses tractable, what I'd like to do is "fill in the gaps" for the empty cells between events so that each row in the data can be tied to the most recent event that has occurred. This is a bit difficult to explain but here's an example:
Now, I'd like to take that and turn it into this:
Doing so will enable me to split the data up by the current event. In any other language I would jump into using a for loop to do this, but I know that R isn't great with loops of that type, and, in this case, I have hundreds of thousands of rows of data to sort through, so am wondering if anyone can offer suggestions for a speedy way of doing this?
Many thanks.
This question has been asked in various forms on this site many times. The standard answer is to use zoo::na.locf. Search [r] for na.locf to find examples how to use it.
Here is an alternative way in base R using rle:
d <- data.frame(LOG_MESSAGE=c('FIRST_EVENT', '', 'SECOND_EVENT', '', ''))
within(d, {
# ensure character data
LOG_MESSAGE <- as.character(LOG_MESSAGE)
CURRENT_EVENT <- with(rle(LOG_MESSAGE), # list with 'values' and 'lengths'
rep(replace(values,
nchar(values)==0,
values[nchar(values) != 0]),
lengths))
})
# LOG_MESSAGE CURRENT_EVENT
# 1 FIRST_EVENT FIRST_EVENT
# 2 FIRST_EVENT
# 3 SECOND_EVENT SECOND_EVENT
# 4 SECOND_EVENT
# 5 SECOND_EVENT
The na.locf() function in package zoo is useful here, e.g.
require(zoo)
dat <- data.frame(ID = 1:5, sample_value = c(34,56,78,98,234),
log_message = c("FIRST_EVENT", NA, "SECOND_EVENT", NA, NA))
dat <-
transform(dat,
Current_Event = sapply(strsplit(as.character(na.locf(log_message)),
"_"),
`[`, 1))
Gives
> dat
ID sample_value log_message Current_Event
1 1 34 FIRST_EVENT FIRST
2 2 56 <NA> FIRST
3 3 78 SECOND_EVENT SECOND
4 4 98 <NA> SECOND
5 5 234 <NA> SECOND
To explain the code,
na.locf(log_message) returns a factor (that was how the data were created in dat) with the NAs replaced by the previous non-NA value (the last one carried forward part).
The result of 1. is then converted to a character string
strplit() is run on this character vector, breaking it apart on the underscore. strsplit() returns a list with as many elements as there were elements in the character vector. In this case each component is a vector of length two. We want the first elements of these vectors,
So I use sapply() to run the subsetting function '['() and extract the 1st element from each list component.
The whole thing is wrapped in transform() so i) I don;t need to refer to dat$ and so I can add the result as a new variable directly into the data dat.

Merging databases in R on multiple conditions with missing values (NAs) spread throughout

I am trying to build a database in R from multiple csvs. There are NAs spread throughout each csv, and I want to build a master list that summarizes all of the csvs in a single database. Here is some quick code that illustrates my problem (most csvs actually have 1000s of entries, and I would like to automate this process):
d1=data.frame(common=letters[1:5],species=paste(LETTERS[1:5],letters[1:5],sep='.'))
d1$species[1]=NA
d1$common[2]=NA
d2=data.frame(common=letters[1:5],id=1:5)
d2$id[3]=NA
d3=data.frame(species=paste(LETTERS[1:5],letters[1:5],sep='.'),id=1:5)
I have been going around in circles (writing loops), trying to use merge and reshape(melt/cast) without much luck, in an effort to succinctly summarize the information available. This seems very basic but I can't figure out a good way to do it. Thanks in advance.
To be clear, I am aiming for a final database like this:
common species id
1 a A.a 1
2 b B.b 2
3 c C.c 3
4 d D.d 4
5 e E.e 5
I recently had a similar situation. Below will go through all the variables and return the most possible information to add back in to the dataset. Once all data is there, running one last time on the first variable will give you the result.
#combine all into one dataframe
require(gtools)
d <- smartbind(d1,d2,d3)
#function to get the first non NA result
getfirstnonna <- function(x){
ret <- head(x[which(!is.na(x))],1)
ret <- ifelse(is.null(ret),NA,ret)
return(ret)
}
#function to get max info based on one variable
runiteration <- function(dataset,variable){
require(plyr)
e <- ddply(.data=dataset,.variables=variable,.fun=function(x){apply(X=x,MARGIN=2,FUN=getfirstnonna)})
#returns the above without the NA "factor"
return(e[which(!is.na(e[ ,variable])), ])
}
#run through all variables
for(i in 1:length(names(d))){
d <- rbind(d,runiteration(d,names(d)[i]))
}
#repeat first variable since all possible info should be available in dataset
d <- runiteration(d,names(d)[1])
If id, species, etc. differ in separate datasets, then this will return whichever non-NA data is on top. In that case, changing the row order in d, and changing the variable order could affect the result. Changing the getfirstnonna function will alter this behavior (tail would pick last, maybe even get all possibilities). You could order the dataset by the most complete records to the least.

Resources