Counting non-missing occurrences - r

I need help counting the number of non-missing data points across files and subsetting out only two columns of the larger data frame.
I was able to limit the data to only valid responses, but then I struggled getting it to return only two of the columns.
I found http://www.statmethods.net/management/subset.html and tried their solution, but myvars did not house my column label, it return the vector of data (1:10). My code was:
myvars <- c("key")
answer <- data_subset[myvars]
answer
But instead of printing out my data subset with only the "key" column, it returns the following errors:
"Error in [.data.frame(observations_subset, myvars) : undefined columns selected" and "Error: object 'answer' not found
Lastly, I'm not sure how I count occurrences. In Excel, they have a simple "Count" function, and in SPSS you can aggregate based on the count, but I couldn't find a command similarly titled in R. The incredibly long way that I was going to go about this once I had the data subsetted was adding in a column of nothing but 1's and summing those, but I would imagine there is an easier way.

To count unique occurrences, use table.
For example:
# load the "iris" data set that's built into R
data(iris)
# print the count of each species
table(iris$Species)
Take note of the handy function prop.table for converting a table into proportions, and of the fact that table can actually take a second argument to get a cross-tab. There's also an argument useNA, to include missing values as unique items (instead of ignoring them).

Not sure whether this is what you wanted.
Creating some data as it was mentioned in the post as multiple files.
set.seed(42)
d1 <- as.data.frame(matrix(sample(c(NA,0:5), 5*10, replace=TRUE), ncol=10))
set.seed(49)
d2 <- as.data.frame(matrix(sample(c(NA,0:8), 5*10, replace=TRUE), ncol=10))
Create a list with datasets as the list elements
l1 <- mget(ls(pattern="d\\d+"))
Create a index to subset the list element that has the maximum non-missing elements
indx <- which.max(sapply(l1, function(x) sum(!is.na(x))))
Key of columns to subset from the larger (non-missing) dataset
key <- c("V2", "V3")
Subset the dataset
l1[[indx]][key]
# V2 V3
#1 1 1
#2 1 3
#3 0 0
#4 4 5
#5 7 8
names(l1[indx])
#[1] "d2"

Related

How do I replace specific cell values in dataframe using continuous (sequential) indexing?

I have two dataframes of equal dimensions.
One has some value in cells (i.e. 'abc') that i need to index. Other has all different values. And I need to replace the values in other dataframe with the same index as 'abc'.
Examples:
df1 <- data.frame('1'=c('abc','bbb','rweq','dsaf','cxc','rwer','anc','ewr','yuje','gda'),
'2'=c(NA,NA,'bbb','dsaf','rwer','dsaf','ewr','cxc','dsaf','cxc'),
'3'=c(NA,NA,'dsaf','abc','bbb','cxc','yuje',NA,'ewr','anc'),
'4'=c(NA,NA,'cxc',NA,'abc','anc',NA,NA,'yuje','rweq'),
'5'=c(NA,NA,'anc',NA,'abc',NA,NA,NA,'rwer','rwer'),
'6'=c(NA,NA,'rweq',NA,'dsaf',NA,NA,NA,'bbb','bbb'),
'7'=c(NA,NA,'abc',NA,'ewr',NA,NA,NA,'abc','abc'),
'8'=c(NA,NA,'abc',NA,'rweq',NA,NA,NA,'cxc','bbb'),
'9'=c(NA,NA,NA,NA,'abc',NA,NA,NA,'anc',NA),
'10'=c(NA,NA,NA,NA,'abc',NA,NA,NA,'rweq',NA))
df2 <- data.frame('1'=c('green','black','white','yelp','help','green','red','brown','green','crack'),
'2'=c(NA,NA,'black','yelp','green','yelp','brown','help','yelp','help'),
'3'=c(NA,NA,'yelp','green','black','help','green',NA,'brown','red'),
'4'=c(NA,NA,'help',NA,'green','red',NA,NA,'green','white'),
'5'=c(NA,NA,'red',NA,'green',NA,NA,NA,'green','green'),
'6'=c(NA,NA,'white',NA,'yelp',NA,NA,NA,'black','black'),
'7'=c(NA,NA,'green',NA,'brown',NA,NA,NA,'green','green'),
'8'=c(NA,NA,'green',NA,'white',NA,NA,NA,'help','black'),
'9'=c(NA,NA,NA,NA,'green',NA,NA,NA,'red',NA),
'10'=c(NA,NA,NA,NA,'green',NA,NA,NA,'white',NA))
I can find sequential index of 'abc', but it returns one-sized vector
which(df1 == 'abc')
#[1] 1 24 35 45 63 69 70 73 85 95
And i don't know how to replace values using this method
In output expected to view df2 with replaced values 'green' only on the same indexes as values 'abc' in df1.
But note!! that 'green' values in df2 are not only in the same indexes as in df1
I don't think your problem is appropriately approached with the data in a data.frame. That introduces several complications. First, each variable (column) in the data frame is a factor with different levels! Second, your code is making a comparison between a list (data.frame) and a factor (which is coerced into an atomic vector). The help function for the == operator states ..if the other is a list R attempts to coerce it to the type of the atomic vector.. The help function also points out that factors get special handling in comparisons where it first assumes you are comparing factor levels, which your code is doing.
I think you want to convert your data frames of identical dimensions to a matrix first. If you need the results in a data.frame, convert it back after as I show here but realize that the factor levels may have changed.
# Starting with the values assigned to df1 and df2
m1 <- as.matrix(df1)
m2 <- as.matrix(df2)
index <- which(m1 == "abc")
m2[index] <- "abc"
df2 <- as.data.frame(m2)
Here is a way to. Learn about the *apply family in R: I think it is the most useful group of functions in this language, whatever you plan to do ;) Also know that data.frame are of 'list' type.
df1 <- lapply(df1, function(frame, pattern, replace){ # for each frame = column:
matches <- which(pattern %in% frame) # what are the matching indexes of the frame
if(length(matches) > 0) # If there is at least one index matching,
frame[matches] <- replace # give it the value you want
return(frame) # Commit your changes back to df1
}, pattern="abc", replace= "<whatYouWant>") # don't forget this part: the needed arguments !

Write script to ignore objects which can’t be found in r

I am trying to construct a script in r to force it to ignore objects it can’t find.
A simplified version of my script is as follows
Trial<-sum(a,b,c,d,e)
A-e are numeric vectors generates by calculating the sum of a column in a data frame.
My problem is I want to use the same script over multiple different conditions (and have far more objects than just a-e). For some of these conditions some of the objects a-e may not exist. Therefore r returns error object d not found.
To avoid having to generate a unique script for each condition I would like to force to ignore any missing objects.
I would be grateful for any help!
Welcome to SO! As mentioned in the comments, in the future try to include a working example in your question. The preferred solution to your problem would be to avoid assigning values to individual variables in the first place. Try to restructure your code so that your column sums get assign to, for example, a list. In the example below, I create some sample data, assign column sum values to a vector, and compute the sum of the vector, without creating a new variable for each column.
# Create sample data
rData <- as.data.frame(matrix(c(1:6), nrow=6, ncol=5, byrow = TRUE))
print(rData)
# Compute column sum
sumVec <- apply(rData, 2, sum)
print(sumVec)
# Compute sum of column sums
total <- sum(sumVec)
print(total)
If you have to use individual variables, before adding them up, you could check if the variable exists, and if not, create it and assign NA. You can then compute the sum of your variables after excluding NA.
# Sample variables
a <- 15
b <- 20
c <- 50
# Assign NA if it doesn't exist (one variable at a time)
if(!exists("d")) { d <- NA }
# Assign NA using sapply (preferred)
sapply(c("a","b","c","d","e"), function(x)
if(!exists(x)) { assign(x, NA, envir=.GlobalEnv) }
)
# Compute sum after excluding NA
altTotal <- sum(na.omit(c(a,b,c,d,e)))
print(altTotal)
Hopefully this will get you closer to the solution!

Vector gets stored as a dataframe instead of being a vector

I am new to r and rstudio and I need to create a vector that stores the first 100 rows of the csv file the programme reads . However , despite all my attempts my variable v1 ends up becoming a dataframe instead of an int vector . May I know what I can do to solve this? Here's my code:
library(readr)
library(readr)
cup_data <- read_csv("C:/Users/Asus.DESKTOP-BTB81TA/Desktop/STUDY/YEAR 2/
YEAR 2 SEM 2/PREDICTIVE ANALYTICS(1_PA_011763)/Week 1 (Intro to PA)/
Practical/cup98lrn variable subset small.csv")
# Retrieve only the selected columns
cup_data_small <- cup_data[c("AGE", "RAMNTALL", "NGIFTALL", "LASTGIFT",
"GENDER", "TIMELAG", "AVGGIFT", "TARGET_B", "TARGET_D")]
str(cup_data_small)
cup_data_small
#get the number of columns and rows
ncol(cup_data_small)
nrow(cup_data_small)
cat("No of column",ncol(cup_data_small),"\nNo of Row :",nrow(cup_data_small))
#cat
#Concatenate and print
#Outputs the objects, concatenating the representations.
#cat performs much less conversion than print.
#Print the first 10 rows of cup_data_small
head(cup_data_small, n=10)
#Create a vector V1 by selecting first 100 rows of AGE
v1 <- cup_data_small[1:100,"AGE",]
Here's what my environment says:
cup_data_small is a tibble, a slightly modified version of a dataframe that has slightly different rules to try to avoid some common quirks/inconsistencies in standard dataframes. E.g. in a standard dataframe, df[, c("a")] gives you a vector, and df[, c("a", "b")] gives you a dataframe - you're using the same syntax so arguably they should give the same type of result.
To get just a vector from a tibble, you have to explicitly pass drop = TRUE, e.g.:
library(dplyr)
# Standard dataframe
iris[, "Species"]
iris_tibble = iris %>%
as_tibble()
# Remains a tibble/dataframe
iris_tibble[, "Species"]
# This gives you just the vector
iris_tibble[, "Species", drop = TRUE]

R - Summation of data frame columns changes data type

I have a data frame of 15 columns where the first column is an integer and others are numeric. I have to generate a one-liner summary of the sum of all columns except the last one. I need to generate mean of the last column. So, I am doing something as below:
summary <- c(sum(df$col1), ... mean(df$col15))
The summary then appears with values up to two decimal places even for the integer column (first one). I have been trying the round function to fix this. I can understand, when different types are added, e.g. 1 + 1.0. But, in this case, shouldn't the summation maintain the data-type?
Please let me know what am I missing?
If you are looking for a one-line summary:
lst <- c(lapply(df[-ncol(df)], function(x) sum(x)), mean=mean(df[,ncol(df)]))
as.data.frame(lst)
# int num1 mean
#1 10 6 2.5
The output is a data frame that preserves the classes of each vector. If you would like the output to be added to the original data frame you can replace as.data.frame(lst) with:
names(lst) <- names(df)
rbind(df, lst)
If you are trying to get the sum of all integer columns and the mean of numeric columns, go with #Frank's answer.
Data
df <- data.frame(int=1:4, num1=seq(1,2,length.out=4), num2=seq(2,3,length.out=4))
Perhaps an adaptation of this?
apply(iris[,1:4], 2, sum) / c(rep(1,3), nrow(iris))

Efficient method to subset drop rows with NA values in R

Background
Before running a stepwise model selection, I need to remove missing values for any of my model terms. With quite a few terms in my model, there are therefore quite a few vectors that I need to look in for NA values (and drop any rows that have NA values in any of those vectors). However, there are also vectors that contain NA values that I do not want to use as terms / criteria for dropping rows.
Question
How do I drop rows from a dataframe which contain NA values for any of a list of vectors? I'm currently using the clunky method of a long series of !is.na's
> my.df[!is.na(my.df$termA)&!is.na(my.df$termB)&!is.na(my.df$termD),]
but I'm sure that there is a more elegant method.
Let dat be a data frame and cols a vector of column names or column numbers of interest. Then you can use
dat[!rowSums(is.na(dat[cols])), ]
to exclude all rows with at least one NA.
Edit: I completely glossed over subset, the built in function that is made for sub-setting things:
my.df <- subset(my.df,
!(is.na(termA) |
is.na(termB) |
is.na(termC) )
)
I tend to use with() for things like this. Don't use attach, you're bound to cut yourself.
my.df <- my.df[with(my.df, {
!(is.na(termA) |
is.na(termB) |
is.na(termC) )
}), ]
But if you often do this, you might also want a helper function, is_any()
is_any <- function(x){
!is.na(x)
}
If you end up doing a lot of this sort of thing, using SQL is often going to be a nicer interaction with subsets of data. dplyr may also prove useful.
This is one way:
# create some random data
df <- data.frame(y=rnorm(100),x1=rnorm(100), x2=rnorm(100),x3=rnorm(100))
# introduce random NA's
df[round(runif(10,1,100)),]$x1 <- NA
df[round(runif(10,1,100)),]$x2 <- NA
df[round(runif(10,1,100)),]$x3 <- NA
# this does the actual work...
# assumes data is in columns 2:4, but can be anywhere
for (i in 2:4) {df <- df[!is.na(df[,i]),]}
And here's another, using sapply(...) and Reduce(...):
xx <- data.frame(!sapply(df[2:4],is.na))
yy <- Reduce("&",xx)
zz <- df[yy,]
The first statement "applies" the function is.na(...) to columns 2:4 of df, and inverts the result (we want !NA). The second statement applies the logical & operator to the columns of xx in succession. The third statement extracts only rows with yy=T. Clearly this can be combined into one horrifically complicated statement.
zz <-df[Reduce("&",data.frame(!sapply(df[2:4],is.na))),]
Using sapply(...) and Reduce(...) can be faster if you have very many columns.
Finally, most modeling functions have parameters that can be set to deal with NA's directly (without resorting to all this). See, for example the na.action parameter in lm(...).

Resources