Insert missing rows by factor level - r

I'm sure there's a simple solution to this problem, but I'm having trouble figuring it out. I have a data frame in the following format:
Number Category Type Count
1 X A 10
2 X B 14
3 Y B 3
4 Z A 14
"Type" is a factor with two levels, {A,B}, and each level gets at least one "Category" entry, (for simplicity, they are denoted XYZ here, but in my actual dataset there are too many to list). I would like the number of rows each Type has to match by Category:
Number Category Type Count
1 X A 10
2 X B 14
3 Y A <NA>
4 Y B 3
5 Z A 14
6 Z B <NA>
For instance, if Type A is listed in four rows of Category A, but Type B has no Category A listings, then four new rows of Category A, Type B should be created (with Count=NA). Similarly, if Type A gets four rows of Category A and Type B has two, then two new rows should be created.
I was able to find numerous answers on how to do this for missing dates in time series data using seq(), expand.grid(), and merge(), but I can't quite see how to do it in this case. I hope this is clear... Grateful for any help!
dat <- read.table(header = TRUE, text =
"Number Category Type Count
1 X A 10
2 X B 14
3 Y B 3
4 Z A 14")

Use expand.grid to make a master list and then merge:
alllevs <- do.call(expand.grid, lapply(dat[c("Type","Category")], levels))
merge(dat, alllevs, all.y=TRUE)
# Category Type Number Count
#1 X A 1 10
#2 X B 2 14
#3 Y A NA NA
#4 Y B 3 3
#5 Z A 4 14
#6 Z B NA NA

Related

Conditionally update Dataframe from second dataframe in R

I have 2 dataframes and would like to use the second to update the first. The problem though is that the second dataframe consists of all entries but either with different amounts of data (as shown below)
DF1 DF2 DF3
X Y X Y X Y
1 A 1 B 1 B
2 <NA> 2 B 2 B
3 <NA> 3 C --> 3 C
4 D 4 <NA> 4 D
5 E 5 <NA> 5 E
It should be a simple update query where entries in DF1 updates if DF2 is not NA
I first thought of removing the NA from the list
DF2sub <- subset(DF2,!is.na(Y)
DF3 <- transform(DF1, Y = DF2sub$Y[match(X,DF2sub$X)])
but the resulting code does the following
DF3
X Y
X Y
1 B
2 B
3 C
4 <NA>
5 <NA>
You can directly use the which function to obtain the indices of the NA and not NA values and map it together. like this.
DF3 <- rbind(DF2[which(!is.na(DF2$Y)),],DF1[which(is.na(DF2$Y)),])
Hope this solves your issue.

data frame column names no longer unique when subsetting

I have a data frame that contains duplicate column names. I'm aware that it's non-standard to use duplicated column names, but these names are actually being reassigned downstream using user inputs. For now, I'm attempting to positionally subset a data frame, but the column names become deduplicated. Here's an example.
> df <- data.frame(x = 1:4, y = 2:5, y = LETTERS[2:5], y = (2+(2:5)), check.names = F)
> df
x y y y
1 1 2 B 4
2 2 3 C 5
3 3 4 D 6
4 4 5 E 7
However, when I attempt to subset, the names change...
> df[, 1:3]
x y y.1
1 1 2 B
2 2 3 C
3 3 4 D
4 4 5 E
Is there any way to prevent this from happening? It only occurs when I subset on columns, not rows.
> df[1:3,]
x y y y
1 1 2 B 4
2 2 3 C 5
3 3 4 D 6
Edit for others noticing this behavior:
I've done some digging into the behavior and this relevant section from the help page for extract.data.frame (type ?'[')
The relevant section states:
If [ returns a data frame it will have unique (and non-missing) row
names, if necessary transforming the row names using make.unique.
Similarly, if columns are selected column names will be transformed to
be unique if necessary (e.g., if columns are selected more than once,
or if more than one column of a given name is selected if the data
frame has duplicate column names).
This explains the why, appreciate the comments so far on addressing how to best navigate this.
Here is an option, although I think it is not a good idea to have duplicated column names.
as.data.frame(as.list(df)[1:3], check.names = F)
# x y y
# 1 1 2 B
# 2 2 3 C
# 3 3 4 D
# 4 4 5 E

Data.table: Add rows for missing combinations of 2 factors without losing associated descriptive factors

I have a data table with multiple factors, for example:
dt <- data.table(station=c(1,1,2,2,3), station.type=c("X","X","Y","Y","Y"), stage=c("A","B","A","B","A"), value=10:14)
station station.type stage value
1: 1 X A 10
2: 1 X B 11
3: 2 Y A 12
4: 2 Y B 13
5: 3 Y A 14
Each station is associated with a type (My actual data has over 50 stations and 10 types). In the example, the combination station 3 / stage B is missing. I want to add rows for the missing combinations, while retaining the type associated with the station.
I started from Matt Dowle's answer to this question:
Fastest way to add rows for missing values in a data.frame?
setkey(dt, station, stage)
dt[CJ(station, stage, unique=TRUE)]
station station.type stage value
1: 1 X A 10
2: 1 X B 11
3: 2 Y A 12
4: 2 Y B 13
5: 3 Y A 14
6: 3 NA B NA
But then I have to do another merge with the original data table to fill in the type for each station.
Is there a way to it all in one line - something like:
dt[CJ(cbind(station, station.type), stage, unique=TRUE)]
(of course this doesn't work because CJ takes vectors as arguments)
Here's one way:
dt[, .SD[.(stage=c("A", "B")), on="stage"], by=.(station, station.type)]

R sort summarise ddply by group sum

I have a data.frame like this
x <- data.frame(Category=factor(c("One", "One", "Four", "Two","Two",
"Three", "Two", "Four","Three")),
City=factor(c("D","A","B","B","A","D","A","C","C")),
Frequency=c(10,1,5,2,14,8,20,3,5))
Category City Frequency
1 One D 10
2 One A 1
3 Four B 5
4 Two B 2
5 Two A 14
6 Three D 8
7 Two A 20
8 Four C 3
9 Three C 5
I want to make a pivot table with sum(Frequency) and used the ddply function like this:
ddply(x,.(Category,City),summarize,Total=sum(Frequency))
Category City Total
1 Four B 5
2 Four C 3
3 One A 1
4 One D 10
5 Three C 5
6 Three D 8
7 Two A 34
8 Two B 2
But I need this results sorted by the total in each Category group. Something like this:
Category City Frequency
1 Two A 34
2 Two B 2
3 Three D 14
4 Three C 5
5 One D 10
6 One A 1
7 Four B 5
8 Four C 3
I have looked and tried sort, order, arrange, but nothing seems to do what I need. How can I do this in R?
Here is a base R version, where DF is the result of your ddply call:
with(DF, DF[order(-ave(Total, Category, FUN=sum), Category, -Total), ])
produces:
Category City Total
7 Two A 34
8 Two B 2
6 Three D 8
5 Three C 5
4 One D 10
3 One A 1
1 Four B 5
2 Four C 3
The logic is basically the same as David's, calculate the sum of Total for each Category, use that number for all rows in each Category (we do this with ave(..., FUN=sum)), and then sort by that plus some tie breakers to make sure stuff comes out as expected.
This is a nice question and I can't think of a straight way of doing this rather than creating a total size index and then sorting by it. Here's a possible data.table approach which uses setorder function which will order your data by reference
library(data.table)
Res <- setDT(x)[, .(Total = sum(Frequency)), by = .(Category, City)]
setorder(Res[, size := sum(Total), by = Category], -size, -Total, Category)[]
# Category City Total size
# 1: Two A 34 36
# 2: Two B 2 36
# 3: Three D 8 13
# 4: Three C 5 13
# 5: One D 10 11
# 6: One A 1 11
# 7: Four B 5 8
# 8: Four C 3 8
Or if you deep in the Hdleyverse, we can reach a similar result using the newer dplyr package (as suggested by #akrun)
library(dplyr)
x %>%
group_by(Category, City) %>%
summarise(Total = sum(Frequency)) %>%
mutate(size= sum(Total)) %>%
ungroup %>%
arrange(-size, -Total, Category)

Merging data frames row-wise and column-wise in R

How can one merge two data frames, one column-wise and other one row-wise? For example, I have two data frames like this:
A: add1 add2 add3 add4
1 k NA NA NA
2 l k NA NA
3 j NA NA NA
4 j l NA NA
B: age size name
1 5 6 x
2 8 2 y
3 1 3 x
4 5 4 z
I want to merge the two data.frames by row.name. However, I want to merge the data.frame A column-wise, instead of row-wise. So, I'm looking for a data.frame like this for result:
C:id age size name add
1 5 6 x k
2 8 2 y l
2 8 2 y k
3 1 3 x j
4 5 4 z j
4 5 4 z l
For example, suppose you have information of people in table B including name, size, etc. These information are unique values, so you have one row per person in B. Then, suppose that in table A, you have up to 5 past addresses of people. First column is the most recent address; second, is the second most recent address; etc. Now, if someone has less than 5 addresses (e.g. 3), you have NA in the 4 and 5 columns for that person.
What I want to achieve is one data frame (C) that includes all of this information together. So, for a person with two addresses, I'll need two rows in table C, repeating the unique values and only different in the column address.
I was thinking of repeat the rows of A data frame by the number of non-NA values while keeping the row.names the same as they were (like data frame D) and then merge the the new data frame with B. But I'm not sure how to do this.
D: address
1 k
2 l
2 k
3 j
4 j
4 l
Thank you!
Change the first data.frame to long format, then it's easy. df1 is A and df2 is B. I also name the numbers id.
require(tidyr)
# wide to long (your example D)
df1tidy <- gather(df1,addname,addval,-id)
# don't need the original add* vars or NA's
df1tidy$addname <- NULL
df1tidy <- df1tidy[!is.na(df1tidy$addval), ]
# merge them into the second data.frame
merge(df2,df1tidy,by = 'id',all.x = T)

Resources