I have a variable df1$StudyAreaVisitNote which I turn into a factor. But when I subsetted the df1 into BS this variable did not remain a factor: using the table( ) function on the subsetted data would show results that seemed to be what should be returned if table() was run on the original data?
Why does this happen?
The two workarounds I found were:
export the subsetted data and re-import
after subsetting, designate the column as a factor again
Code:
# My dataset can be found here: http://textuploader.com/9tx5 (I'm sure there's a better way to host it, but I'm new, sorry!)
# Load Initial Dataset (df1)
df1 <- read.csv("/Users/user/Desktop/untitled folder/pre_subset.csv", header=TRUE,sep=",")
# Make both columns factors
df1$Trap.Type <- factor(df1$Trap.Type)
df1$StudyAreaVisitNote <-factor(df1$StudyAreaVisitNote)
# Subset out site of interest
BS <- subset(df1, Trap.Type=="HR-BA-BS")
# Export to Excel, save as CSV after it's in excel
library(WriteXLS)
WriteXLS("BS", ExcelFileName = "/Users/user/Desktop/test.xlsx", col.names = TRUE, AdjWidth = TRUE, BoldHeaderRow = TRUE, FreezeRow = 1)
# Load second Dataset (df2)
df2 <- read.csv("/Users/user/Desktop/untitled folder/post_subset.csv", header=TRUE, sep=",")
# both datasets should be identical, and they are superficially, but...
# Have a look at df2
summary(df2$StudyAreaVisitNote) # Looks good, only counts levels that are present
# Now, look at BS from df1
summary(BS$StudyAreaVisitNote) # sessions not present in the subsetted data (but present in df1?) are included???
# Make BS$StudyAreaVisitNote a factor...Again??
BS$StudyAreaVisitNote <- factor(BS$StudyAreaVisitNote)
# Try line 31 again
summary(BS$StudyAreaVisitNote) # this time it works, why is factor not carried through during subset?
A factor is maintained a factor even after subsetting. I'm sure class(BS$StudyAreaVisitNote)=="factor". But, factors don't automatically drop their unused levels. This can be helpful when you are doing stuff like
set.seed(16)
dd<-data.frame(
gender=sample(c("M","F"), 25, replace=T),
age=rpois(25, 20)
)
dd
table(subset(dd, age<15)$gender)
# F M
# 0 3
Here the factor remember that it had M and F's and even if the subset doesn't have any F's the levels are still retained. You may explicitly call droplevels() if you want to get rid of unused levels.
table(droplevels(subset(dd, age<15))$gender)
# M
# 3
(now it forgot about the F's)
So instead of summary, compare the results of table on your two data.frames.
Related
I have some large data frames that are big enough to push the limits of R on my machine; e.g., the one on which I'm currently working is 2 columns by 70 million rows. The contents aren't important, but just in case, column 1 is a string and column 2 is an integer.
What I would like to do is split that data frame into n parts (say, 20, but preferably something that could change on a case-by-case basis) so that I can work on each of the smaller data frames one at a time. That means that (a) the result has to produce things that are named (e.g., "newdf_1", "newdf_2", ... "newdf_20" or something), and (b) each line in the original data frame needs to be in one (and only one) of the new "sub" data frames. The order does not matter, but doing it sequentially by rows makes sense to me.
Once I do the work, I will start to recombine them (using rbind()) one pair at a time.
I've looked at split(), but from what I can tell, it is designed to work with factors (which I don't have).
Any ideas?
You can create a new column and split the data frame based on that column. The column does not need to be a factor, but need to be a data type that can be converted to a factor by the split function.
# Number of groups
N <- 20
dat$group <- 1:nrow(dat) %% N
# Add 1 to group
dat$group <- dat$group + 1
# Split the dat by group
dat_list <- split(dat, f = ~group)
# Set the name of the list
names(dat_list) <- paste0("newdf_", 1:N)
Data
set.seed(123)
# Create example data frame
dat <- data.frame(
A = sample(letters, size = 70000000, replace = TRUE),
B = rpois(70000000, lambda = 1)
)
Here's a tidyverse based solution. Try using read_csv_chunked().
# practice data
tibble(string = sample(letters, 1e6, replace = TRUE),
value = rnorm(1e6) %>%
write_csv("test.csv")
# here's the solution
partial_data <- read_csv_chunked("test.csv",
DataFrameCallback$new(function(x, pos) filter(x, string == "a")),
chunk_size = 1000)
You can wrap the call to read_csv_chunked in a function where you change the string that you subset on.
This is more or less a repeat of this question:
How to read only lines that fulfil a condition from a csv into R?
I am using NHANES demographic data from 2011-2016. I've managed to download all 3 datasets, however can't seem to merge all three due to NHANES 2011-2012 having 48 variables, while the other 2 have 47. Problem is, I've tried to not include it, but I need the number of people 18+ in this question to be included in my data. How else can I merge the if the number of variables dont match? Tried R-bind, c-bind, merge, and various things. I just can't seem to figure out what I'm doing wrong.
****
See code below:
library(haven)
nhanes = read_xpt('https://wwwn.cdc.gov/Nchs/Nhanes/2011-2012/DEMO_G.XPT')
nhanes2 = read_xpt('https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.XPT')
nhanes3 = read_xpt('https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT')
totalnhanes <- rbind(nhanes,nhanes2,nhanes3)
Add the missing variables and set all values to NA:
setdiff(names(nhanes), names(nhanes2))
#[1] "RIDEXAGY"
nhanes2$RIDEXAGY <- NA
setdiff(names(nhanes), names(nhanes3))
#[1] "RIDEXAGY"
nhanes3$RIDEXAGY <- NA
totalnhanes <- rbind(nhanes,nhanes2,nhanes3) # Works. :)
You can just bind_rows and any missing column (matched by column name) will be filled with NA.
library(dplyr)
df<-bind_rows(nhanes,nhanes2,nhanes3)
Rbindlist from data.table can serve the purpose.
library(haven)
library(data.table)
nhanes = read_xpt('https://wwwn.cdc.gov/Nchs/Nhanes/2011-2012/DEMO_G.XPT')
nhanes2 = read_xpt('https://wwwn.cdc.gov/Nchs/Nhanes/2013-2014/DEMO_H.XPT')
nhanes3 = read_xpt('https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT')
l <- list(nhanes,nhanes2,nhanes3)
totalnhanes <- rbindlist(l, fill = TRUE)
I need help counting the number of non-missing data points across files and subsetting out only two columns of the larger data frame.
I was able to limit the data to only valid responses, but then I struggled getting it to return only two of the columns.
I found http://www.statmethods.net/management/subset.html and tried their solution, but myvars did not house my column label, it return the vector of data (1:10). My code was:
myvars <- c("key")
answer <- data_subset[myvars]
answer
But instead of printing out my data subset with only the "key" column, it returns the following errors:
"Error in [.data.frame(observations_subset, myvars) : undefined columns selected" and "Error: object 'answer' not found
Lastly, I'm not sure how I count occurrences. In Excel, they have a simple "Count" function, and in SPSS you can aggregate based on the count, but I couldn't find a command similarly titled in R. The incredibly long way that I was going to go about this once I had the data subsetted was adding in a column of nothing but 1's and summing those, but I would imagine there is an easier way.
To count unique occurrences, use table.
For example:
# load the "iris" data set that's built into R
data(iris)
# print the count of each species
table(iris$Species)
Take note of the handy function prop.table for converting a table into proportions, and of the fact that table can actually take a second argument to get a cross-tab. There's also an argument useNA, to include missing values as unique items (instead of ignoring them).
Not sure whether this is what you wanted.
Creating some data as it was mentioned in the post as multiple files.
set.seed(42)
d1 <- as.data.frame(matrix(sample(c(NA,0:5), 5*10, replace=TRUE), ncol=10))
set.seed(49)
d2 <- as.data.frame(matrix(sample(c(NA,0:8), 5*10, replace=TRUE), ncol=10))
Create a list with datasets as the list elements
l1 <- mget(ls(pattern="d\\d+"))
Create a index to subset the list element that has the maximum non-missing elements
indx <- which.max(sapply(l1, function(x) sum(!is.na(x))))
Key of columns to subset from the larger (non-missing) dataset
key <- c("V2", "V3")
Subset the dataset
l1[[indx]][key]
# V2 V3
#1 1 1
#2 1 3
#3 0 0
#4 4 5
#5 7 8
names(l1[indx])
#[1] "d2"
In the midst of merging several data sets, I'm trying to remove all rows of a data frame that have a missing value for one particular variable (I want to keep the NAs in some of the other columns for the time being). I used the following line:
data.frame <- data.frame[!is.na(data.frame$year),]
This successfully removes all rows with NAs for year, (and no others), but the other columns, which previously had data, are now entirely NAs. In other words, non-missing values are being converted to NA. Any ideas as to what's going on here? I've tried these alternatives and got the same outcome:
data.frame <- subset(data.frame, !is.na(year))
data.frame$x <- ifelse(is.na(data.frame$year) == T, 1, 0);
data.frame <- subset(data.frame, x == 0)
Am I using is.na incorrectly? Are there any alternatives to is.na in this scenario? Any help would be greatly appreciated!
Edit Here is code that should reproduce the issue:
#data
tc <- read.csv("http://dl.dropbox.com/u/4115584/tc2008.csv")
frame <- read.csv("http://dl.dropbox.com/u/4115584/frame.csv")
#standardize NA codes
tc[tc == "."] <- NA
tc[tc == -9] <- NA
#standardize spatial units
colnames(frame)[1] <- "loser"
colnames(frame)[2] <- "gainer"
frame$dyad <- paste(frame$loser,frame$gainer,sep="")
tc$dyad <- paste(tc$loser,tc$gainer,sep="")
drops <- c("loser","gainer")
tc <- tc[,!names(tc) %in% drops]
frame <- frame[,!names(frame) %in% drops]
rm(drops)
#merge tc into frame
data <- merge(tc, frame, by.x = "year", by.y = "dyad", all.x=T, all.y=T) #year column is duplicated in this process. I haven't had this problem with nearly identical code using other data.
rm(tc,frame)
#the first column in the new data frame is the duplicate year, which does not actually contain years. I'll rename it.
colnames(data)[1] <- "double"
summary(data$year) #shows 833 NA's
summary(data$procedur) #note that at this point there are non-NA values
#later, I want to create 20 year windows following the events in the tc data. For simplicity, I want to remove cases with NA in the year column.
new.data <- data[!is.na(data$year),]
#now let's see what the above operation did
summary(new.data$year) #missing years were successfully removed
summary(new.data$procedur) #this variable is now entirely NA's
I think the actual problem is with your merge.
After you merge and have the data in data, if you do:
# > table(data$procedur, useNA="always")
# 1 2 3 4 5 6 <NA>
# 122 112 356 59 39 19 192258
You see there are these many (122+112...+19) values for data$procedur. But, all these values are corresponding to data$year = NA.
> all(is.na(data$year[!is.na(data$procedur)]))
# [1] TRUE # every value of procedur occurs where year = NA
So, basically, all values of procedur are also removed because you removed those rows checking for NA in year.
To solve this problem, I think you should use merge as:
merge(tc, frame, all=T) # it'll automatically calculate common columns
# also this will not result in duplicated year column.
Check if this merge gives you the desired result.
Try complete.cases:
data.frame.clean <- data.frame[complete.cases(data.frame$year),]
...though, as noted above, you may want to pick a more descriptive name.
I have a data frame with 251 observations and 45 variables. There are 6 observations in the middle of the data frame that i'd like to exclude from my analyses. All 6 belong to one level of a factor. It is easy to generate a new data frame that, when printed, appears to exclude the 6 observations. When I use the new data frame to plot variables by the factor in question, however, the supposedly excluded level is still included in the plot (sans observations). Using str() confirms that the level is still present in some form. Also, the index for the new data frame skips 6 values where the observations formerly resided.
How can I create a new data frame that excludes the 6 observations and does not continue to recognize the excluded factor level when plotting? Can the new data frame be made to "re-index", so that the new index does not skip values formerly assigned to the excluded factor level?
I've provided an example with made up data:
# ---------------------------------------------
# data
char <- c( rep("anc", 4), rep("nam", 3), rep("oom", 5), rep("apt", 3) )
a <- 1:15 / pi
b <- seq(1, 8, .5)
d <- rep(c(3, 8, 5), 5)
dat <- data.frame(char, a, b, d)
dat
# two ways to remove rows that contain a string
datNew1 <- dat[-which(dat$char == "nam"), ]
datNew1
datNew2 <- dat[grep("nam", dat[ ,"char"], invert=TRUE), ]
datNew2
# plots still contain the factor level that was excluded
boxplot(datNew1$a ~ datNew1$char)
boxplot(datNew2$a ~ datNew2$char)
# str confirms that it's still there
str(datNew1)
str(datNew2)
# ---------------------------------------------
You can use the drop.levels() function from the gdata package to reduce the factor levels down to the actually used ones -- apply it on your column after you created the new data.frame.
Also try a search for r and drop.levels here (but you need to make the search term [r] drop.levels which I can't here as it interferes with the formatting logic).
Starting with R version 2.12.0, there is a function droplevels, which can be applied either to factor columns or to the entire dataframe. When applied to the dataframe, it will remove zero-count levels from all factor columns. So your example will become simply:
# two ways to remove rows that contain a string
datNew1 <- droplevels( dat[-which(dat$char == "nam"), ] )
datNew2 <- droplevels( dat[grep("nam", dat[ ,"char"], invert=TRUE), ] )
I have pasted something from my code- I have an enclosure experiment in a lake- have measurements from enclosures and the lake but mostly dont want to deal with lake:
my variable is called "t.level" and the levels were control, low medium high and lake-
-this code makes it possible to use the nolk$ or data=nolk to get data without the "lake"..
nolk<-subset(mylakedata,t.level == "control" |
t.level == "low" |
t.level == "medium" |
t.level=="high")
nolk[]<-lapply(nolk, function(t.level) if(is.factor(t.level))
t.level[drop=T]
else t.level)