I'm learning R from scratch right now and am trying to count the number of NA's within a given table, aggregated by the ID of the file it came from. I then want to output that information in a new data frame, showing just the ID and the sum of the NA lines contained within. I've looked at some similar questions, but they all seem to deal with very short datasets, whereas mine is comparably long (10k + lines) so I can't call out each individual line to aggregate.
Ideally, if I start with a data table called "Data" with a total of four columns, and one column called "ID", I would like to output a data frame that is simply:
[ID] [NA_Count]
1 500
2 352
3 100
Thanks in advance...
Something like the following should work, although I am assuming that Date is always there and Field 1 and Field 2 are numeric:
# get file names and initialize a vector for the counts
fileNames <- list.files(<filePath>)
missRowsVec <- integer(length(fileNames))
# loop through files, get number of
for(filePos in 1:length(fileNames)) {
# read in files **fill in <filePath>**
temp <- read.csv(paste0(<filePath>, fileNames[filePos]), as.is=TRUE)
# count the number of rows with missing values,
# ** fill in <fieldName#> with strings of variable names **
missRowsVec[filePos] <- sum(apply(temp[, c(<field1Name>, <field2Name>)],
function(i) anyNA(i)))
} # end loop
# build data frame
myDataFrame <- data.frame("fileNames"=fileNames, "missCount"=missRowsVec)
This may be a bit dense, but it should work more or less. Try small portions of it, like just some inner function, to see how stuff works.
Related
I have a presence absence database with a bunch of zeroes and ones, but when i use rowsums, it seems to only count a portion of the data and then stop. Here's my code
site_matrix=read.csv("TriassicMatrix1.csv", header=T) # create object called site_matrix
summary(site_matrix) # get summary
head(site_matrix) # check out first few columns
tail(site_matrix) # check out last few columns
View(site_matrix) # take a look at whole dataset in new window
# here's the problematic line
spp_rich=rowSums(site_matrix [,2:25]) # generate richness for sites
There are 25 rows of data, and it gives me incorrect output, such as suggesting the first row only has 4 occurances when it has 7.
I tried changing it to [,1,25] and it won't work since row 1 is my title row, so I know it's not that. When I view the data within R i can very easily go to row 2 and count out the data, since there is only a few hundred columns.
It appears to be 'cutting off' at about the halfway point, column-wise.
As below, dataframe factorizedss is the factorized version of a sourcedata dataframe ss.
ss <- data.frame(c('a','b','a'), c(1,2,1)); #There are string columns and number columns.
#So, I factorized them as below.
factorizedss <- data.frame(lapply(ss, as.factor)); #factorized version
indices <- data.frame(c(1,1,2,2), c(1,1,1,2)); #Now, given integer indices
With given indices, using factorizedss, is it possible to get corresponding element of the source dataframe as below? (The purpose is to access data frame element by integer number in factor level )
a 1
a 1
b 1
b 2
You can access the first column like this
factorizedss[indices[,1],][,1]
and the second in a similar way
factorizedss[indices[,2],][,2]
It gets more difficult when trying to combine them, you might have to convert them back to native types
t(rbind(as.character(factorizedss[indices[,1],][,1]),as.numeric(factorizedss[indices[,2],][,2])))
I'm working with a large data set at the moment. I am looking at the water RightID # and trying to separate all duplicates from single rights. Duplicate rights are to be dealt with in a different manner than the single ones. I am using the dplyr package and have the following script written out so far.
# Change data to a tibble
tbl.all.rights <- tbl_df(rights$RightID)
# filter through duplicate rightIDs
# creates a new data frame with T for duplicate and F for non duplicate.
log.dup <- data.frame(as.numeric(duplicated(tbl.all.rights)))
log.dup$RightID <- tbl.all.rights$value`
However, the duplicated function returns a FALSE value for the first duplicate because of the order in which the function goes through the vector.
> e.g.) Duplicate RightId
> 0 1000
> 0 999
> 1 999
> 1 999
I would like to preserve duplicate rights in their own data base. I was considering writing my own function to capture that first duplicate, and use that in conjunction with sapply. However, I'm having trouble writing that function. Any guidance would be appreciated
I have a data set that has a number of columns, but to keep it short here's an abbreviated form (the data is from the Divvy competition)
Trip ID Tripduration from_id to_id
1 50 2 2
2 700 2 5
3 80 2 4
When I imported the data from the .csv R made it into a data.frame, which is OK. So using
full.set2<-sapply(full.set, function(x)
if(is.factor(x)){
as.numeric(x)
}else
{
x
})
I was able to convert the entire thing into a "Large Matrix" (according to RStudio). So Now I'm trying to clear out the values that meet 2 criteria:
1) Tripduration <= 90
&&
2) from_id == to_id
When I do
full.set2t<-full.set2[full.set2[,2]>=90]
It makes full.set2t into one very large vector rather than keeping it as a matrix (though it does look like it might be removing the proper values, as the number of elements decreased).
I've also tried subset on the original data.frame but I got the error that "> not meaningful for factors"
Any ideas? I've searched around and can't seem to get any of the other solutions I'v efound to work
EDIT: As I'm continuing searching I'll put here other things I've tried that didn't work:
x<-seq(1:90)
x<-as.numeric(x)
y<- full.set[! full.set$tripduration %in% x,]
## Does something, removes some data points but not all of the proper ones
Solution found!
full.set$tripduration<-as.numeric(full.set$tripduration)
full.set.test<-full.set[full.set$tripduration>90]
Turns out that the column was a factor and not numeric, and I didn't know how to convert that single column
The problem is this line
full.set2t<-full.set2[full.set2[,2]>=90]
To subset a data.frame you need to use [rows,columns], where leaving one blank means select eveything. So the line should be
full.set2t<-full.set2[full.set2[,2]>=90,] # note the comma
I am trying to build a database in R from multiple csvs. There are NAs spread throughout each csv, and I want to build a master list that summarizes all of the csvs in a single database. Here is some quick code that illustrates my problem (most csvs actually have 1000s of entries, and I would like to automate this process):
d1=data.frame(common=letters[1:5],species=paste(LETTERS[1:5],letters[1:5],sep='.'))
d1$species[1]=NA
d1$common[2]=NA
d2=data.frame(common=letters[1:5],id=1:5)
d2$id[3]=NA
d3=data.frame(species=paste(LETTERS[1:5],letters[1:5],sep='.'),id=1:5)
I have been going around in circles (writing loops), trying to use merge and reshape(melt/cast) without much luck, in an effort to succinctly summarize the information available. This seems very basic but I can't figure out a good way to do it. Thanks in advance.
To be clear, I am aiming for a final database like this:
common species id
1 a A.a 1
2 b B.b 2
3 c C.c 3
4 d D.d 4
5 e E.e 5
I recently had a similar situation. Below will go through all the variables and return the most possible information to add back in to the dataset. Once all data is there, running one last time on the first variable will give you the result.
#combine all into one dataframe
require(gtools)
d <- smartbind(d1,d2,d3)
#function to get the first non NA result
getfirstnonna <- function(x){
ret <- head(x[which(!is.na(x))],1)
ret <- ifelse(is.null(ret),NA,ret)
return(ret)
}
#function to get max info based on one variable
runiteration <- function(dataset,variable){
require(plyr)
e <- ddply(.data=dataset,.variables=variable,.fun=function(x){apply(X=x,MARGIN=2,FUN=getfirstnonna)})
#returns the above without the NA "factor"
return(e[which(!is.na(e[ ,variable])), ])
}
#run through all variables
for(i in 1:length(names(d))){
d <- rbind(d,runiteration(d,names(d)[i]))
}
#repeat first variable since all possible info should be available in dataset
d <- runiteration(d,names(d)[1])
If id, species, etc. differ in separate datasets, then this will return whichever non-NA data is on top. In that case, changing the row order in d, and changing the variable order could affect the result. Changing the getfirstnonna function will alter this behavior (tail would pick last, maybe even get all possibilities). You could order the dataset by the most complete records to the least.