Combine, count df columns w/o repeating other columns - r

A simple question that I am completely stumped on after consulting packages thatI thought would help(plyr, reshape, unique)
Let's say I have the df below:
df <- data.frame(location=c("ny","nj","pa","ct"),
animal=c("dog","hamster","dog","pig"),
animal2=c("cat","dog","pig","dog"))
I would like to count the unique entities in specific columns and then rank occurrences. So here, I'd like to count the combined unique entities in the columns animal and animal2. If I use reshape and melt, the associated location values will repeat in the additional rows...but I don't want that because I only want to count the frequencies of the "location' variables as given in the original df.
Is there a way to rbind without repeating other columns? So in this case I would have another column called AnimalMaster and that would have all of the frequencies I need.
When I try count(df,c("animal","animal2")), it counts the joint occurrences, which is not what I want. Alternatively, I could also do this by just counting the unique strings across multiple columns without combining them. Is there a straightforward way to do this without running into the count problem?
Thank you for helping a beginner.
EDIT:
My desired output is the following:
countsdf with columns (Type, Name, Frequency, Frequency (%)), so that top row would be:
AnimalMaster | dog | 4 | 100%

Here's a suggestion with reshape2 and data.table
require(reshape2)
require(data.table)
dt <- data.table(melt(df, id.vars = 'location', value.name = 'animal'))
dt[, list(n=length(unique(location)),
percent=100*.N/dt[, length(unique(location))]),
by=animal]
# animal n percent
# 1: dog 4 100
# 2: hamster 1 25
# 3: pig 2 50
# 4: cat 1 25

Related

How do I write back results of a count query to a column in R?

I would like to count the instances of a Employee ID in a column and write back the results to a new column in my dataframe. So far I am able to count the instances and display the results in the R Studio console, but I'm not sure how to write the results back. Here is what I have tested successfully:
ids<-BAR$`Employee ID`
counts<-data.frame(table(ids))
counts
And here are the returned results:
1 00000018 1
2 00000179 1
3 00001045 1
4 00002729 1
5 00003095 2
6 00003100 1
Thanks!
If we need to create a column, use add_count
library(dplyr)
BAR1 <- BAR %>%
add_count(`Employee ID`)
table returns the summarised output. If we want to create a column in the original data
BAR1$n <- table(ids)[as.character(BAR$`Employee ID`)]
If you use a data.table you will be able to do this quickly, especially with larger datasets, using .N to count number of occurrences per grouping variable given in by.
# Load data.table
library(data.table)
# Convert data to a data.table
setDT(BAR)
# Count and assign counts per level of ID
BAR[, count := .N, by = ID]

Given large data.table, use binary search to find the correct row based on the first two columns and then add 1 to third column

I have a dataframe with 3 columns. First two columns are IDs (ID1 and ID2) referring to the same item and the third column is a count of how many times items with these two IDs appear. The dataframe has many rows so I want to use binary search to first find the appropriate row where both IDs match and then add 1 to the cell under the count column in that row.
I have used the which() function to find the index of the correct row and then using the index added 1 to the count column.
For example:
index <- which(DF$ID1 == x & DF$ID1 == y)
DF$Count[index] <- DF$Count[index] + 1
While this works, the which function is very inefficient. Because I have to do this within a for loop for more than a trillion times, it takes a lot of time. Also, there is only one row in the data frame with this ID combination. While the which function goes through all the rows, a function that stops once it finds the correct row should suffice. I have looked into using data.table and setkey for this purpose but do not know how to implement that for my purpose. Thank you in advance.
Indeed you can use data.table and setkeyv (not setkey because you need 2 columns as indexes)
library(data.table)
DF <- data.frame(ID1=sample(1:100,100000,replace=TRUE),ID2=sample(1:100,100000,replace=TRUE))
# convert DF to a data.table
DF <- as.data.table(DF)
# put both ID1 and ID2 as indexes, in that order
setkeyv(DF,c("ID1","ID2"))
# random x and y values
x <- 10
y <- 18
# select value for ID1=x and ID2=y and add 1 in the Count column
DF[.(x,y),"Count"] <- DF[,.(x,y),"Count"]+1

splitting data frame by repeating strings [duplicate]

This question already has answers here:
Split data.frame based on levels of a factor into new data.frames
(3 answers)
Closed 2 years ago.
I have a data frame where one column will repeat the same string for a number of lines (it varies). I'd like to split the data frame based on each of the repeating names into separate data frames (the output can be a list). For example for this data frame:
dat = data.frame(names=c('dog','dog','dog','dog','cat','cat'), value=c(1,2,3,4,5,5))
The output should be
names value
dog 1
dog 2
dog 3
dog 4
and
names value
cat 5
cat 5
I should mention there are thousands of different repeating names.
You can use the split function, which will give the output in a list. I think it would be easier to have the datasets in the list as most of the operations can be performed within the list itself
split(dat, dat$names)
If in case you want to split the 'dog', 'cat', 'dog' as a 'list' with 3 elements (based on the example showed by #BondedDust), one option is
indx <- inverse.rle(within.list(rle(as.character(dat$names)),
values <- seq_along(values)))
split(dat, indx)
Or using the devel version of data.table, we can use rleid to create a grouping variable
library(data.table)#v1.9.5+
setDT(dat)[, grp:= rleid(names)]
and then use the standard data.table operations for the different groups by specifying the 'grp' as the grouping variable.

How to drop columns from data frame with less than 2 unique levels in R

I have a dataset with numeric and categorical variables with ~200,000 rows, but many variables are constants(both numeric and cat). I am trying to create a new dataset where the length(unique(data.frame$factor))<=1 variables are dropped.
Example data set and attempts so far:
Temp=c(26:30)
Feels=c("cold","cold","cold","hot","hot")
Time=c("night","night","night","night","night")
Year=c(2015,2015,2015,2015,2015)
DF=data.frame(Temp,Feels,Time,Year)
I would think a loop would work, but something isn't working in my 2 below attempts. I've tried:
for (i in unique(colnames(DF))){
Reduced_DF <- DF[,(length(unique(DF$i)))>1]
}
But I really need a vector of the colnames where length(unique(DF$columns))>1, so I tried the below instead, to no avail.
for (i in unique(DF)){
if (length(unique(DF$i)) >1)
{keepvars <- c(DF$i)}
Reduced_DF <- DF[keepvars]
}
Does anyone out there have experience with this type of subsetting/dropping of columns with less than a certain level count?
You can find out how many unique values are in each column with:
sapply(DF, function(col) length(unique(col)))
# Temp Feels Time Year
# 5 2 1 1
You can use this to subset the columns:
DF[, sapply(DF, function(col) length(unique(col))) > 1]
# Temp Feels
# 1 26 cold
# 2 27 cold
# 3 28 cold
# 4 29 hot
# 5 30 hot
Another way with data.table
#Convert object to data.table object
library(data.table)
setDT(DF)
#Drop columns
todrop <- names(DF)[which(sapply(DF,uniqueN)<2)]
DF[, (todrop) := NULL]
One advantage to this method is that it does not make a copy (which might be useful when you have as many columns as you have).
If you are using data.table 1.9.4, you would change to the following:
#Drop columns
todrop <- names(DF)[which(sapply(DF,function(x) length(unique(x)<2))]
DF[, (todrop) := NULL]
I've also another possible solution for dropping the columns with categorical value with 2 lines of code, defining a list with columns of categorical values (1st line) and dropping them with the second line. df is our dataframe
df with categorical column:
list=pd.DataFrame(df.categorical).columns
df= df.drop(list,axis=1)
df after running the code:

How to combine two columns of factors into one column without changing the factor levels into number [duplicate]

This question already has answers here:
Joining factor levels of two columns
(3 answers)
Closed 4 years ago.
I am trying to find a way to combine two columns of factors into one column without changing the factor levels into numbers. For instance, consider the following two data.frame datasets
dataset 1 dataset 2
Number Student Number Student
1 Chris 1 Matt
2 Sarah 2 Keith
I am trying to take "student" column from the dataset1 and the "student" column from the dataset2, and make one big student column containing the names "Chris", "Sarah", "Matt", and "Keith"
I tried:
student.list<-c(dataset1[,2],dataset2[,2])
student.list
However, this doesn't work since the names turns into numbers with c() function. I want my list to preserve the names of students (i.e. without converting them into numbers). I also tried cbind(), but gives same problem as c()...
Thank you
factors are numbers that happen to have labels. When you combine factors, you generally are combining their numeric values. This can often trip a person up.
If you want their labels, you must coerce them to strings, using as.character
student.list <- c( as.character(dataset1[,2]) ,
as.character(dataset2[,2]) )
If you want to get that back to factors, wrap it all in as.factor (can be all in one line, or split into two lines for easier reading)
student.list <- c(as.character(dataset1[,2]),as.character(dataset2[,2]))
student.list <- as.factor(student.list)
There is interaction() function in the base R package.
There is also strata() function in the survival package.
The data.table package, which extends the functionality of data frames in some very useful ways, will combine factors automatically when you use the rbindlist function. Plus, if your two data sets are large, it will usually combine them more quickly.
library(data.table)
# Example data:
# (If you already have data frames, you can convert them using `as.data.table(dataframename)`)
dataset1<-data.table(Number=1:2,Student=as.factor(c("Chris","Sarah")))
dataset2<-data.table(Number=1:2,Student=as.factor(c("Matt","Keith")))
# Combine the two data sets:
# (It's not necessary to convert factors to characters)
rbindlist(list(dataset1,dataset2))
# Number Student
#1: 1 Chris
#2: 2 Sarah
#3: 1 Matt
#4: 2 Keith
You can now do this easily with fct_c() from the forcats package.
dataset1 <- data.frame(Number = c(1,2), Student = factor(c('Chris','Sarah')))
dataset2 <- data.frame(Number = c(1,2), Student = factor(c('Matt','Keith')))
library(forcats)
fct_c(list(dataset1[ ,2], dataset2[ ,2]))
# [1] Chris Sarah Matt Keith
# Levels: Chris Sarah Keith Matt
If you factors are inside of data frames then you can combine them this way using rbind:
> df1 <- data.frame(x=factor(c('a','b')))
> df2 <- data.frame(x=factor(c('c','d')))
> rbind(df1,df2)
x
1 a
2 b
3 c
4 d

Resources