Append 0 to missing observations in a dataframe. - r

I have a dataset where I expect a fixed number of observations in a data-frame
A 20
B 10
C 5
However, upon running my analysis this is not always the case sometimes I find missing observations and the resulting dataframe looks like this
A 10
C 5
In this case there are no observations for B. I would want to append 0 observations to the final dataframe before ploting so as to indicate the values of the missing observation.
final data frame should look like this
A 10
B 0
C 5
How can I accomplish this in R?

If you define the ID column (with A,B,C) as factor which seems appropriate here, you could plot the data and even those factor levels which are not in the data (but in the defined factor levels) will be plotted. Here's a small example:
df <- data.frame(ID = LETTERS[1:3], x = rnorm(3))
df
# ID x
#1 A 1.350458
#2 B 1.340855
#3 C 1.311329
subdf <- df[c(1,3),]
subdf
# ID x
#1 A 1.350458
#3 C 1.311329
with(subdf, plot(x ~ ID))
You'll find that "B" is also present in the plot although it's not in the subsetted data.

Maybe you can do something with melt and dcast from "reshape2".
Here's what I had in mind:
library(reshape2)
out <- dcast(
melt( # Makes a data.frame from a list
mget(ls(pattern = "df\\d")), # Collects the relevant df in a list
id.vars = "V1"), # The variable to melt by
L1 ~ V1, value.var = "value", fill = 0) # Other options for dcast
out
# L1 A B C
# 1 df1 20 10 5
# 2 df2 10 0 5
From there, you could go back to a long data form.
melt(out, id.vars = "L1")
# L1 variable value
# 1 df1 A 20
# 2 df2 A 10
# 3 df1 B 10
# 4 df2 B 0
# 5 df1 C 5
# 6 df2 C 5
If separate data.frames are required, then you can also look at using split, but if you are just going to be plotting, this format should work just fine.
Sample data
df1 <- structure(list(V1 = c("A", "B", "C"), V2 = c(20L, 10L, 5L)),
.Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -3L))
df2 <- structure(list(V1 = c("A", "C"), V2 = c(10L, 5L)),
.Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -2L))

Related

Add values to dataframe based on values in other column

I would like to add values to a column based on non-unique values in another column. For example, say I have a dataframe with a currently empty column that looks like this:
Site
Species Richness
A
0
A
0
A
0
B
0
B
0
I want to assign known species richness values for each site. Let's say site A has species richness 3, and site B has species richness 5. I would like the output to be:
Site
Species Richness
A
3
A
3
A
3
B
5
B
5
How do I input species richness values for specific sites?
I've tried this:
rows_update(df, tibble(Site = A, richness = 3))
rows_update(df, tibble(Site = B, richness = 5))
But I get an error message saying "'x' key values are not unique"
Any help would be appreciated!
Here, we could make use of join on from data.table and assign := the corresponding column of 'SpeciesRichness'. It would be more efficient
library(data.table)
setDT(df)[data.table(Site = c('A','B'), SpeciesRichness = c(3, 5)),
SpeciesRichness := i.SpeciesRichness, on = .(Site)]
The issue with ?rows_update is that the by column should be uniquely identifying in both data.
The two tables are matched by a set of key variables whose values must uniquely identify each row.
With 'df', the values are replicated 3 times for 'A' and 2 for 'B'. Using dplyr, we can do a left_join
library(dplyr)
df %>%
left_join(tibble(Site = c('A', "B"), new = c(3, 5))) %>%
transmute(Site, SpeciesRichness = new)
-output
# Site SpeciesRichness
#1 A 3
#2 A 3
#3 A 3
#4 B 5
#5 B 5
data
df <- structure(list(Site = c("A", "A", "A", "B", "B"),
SpeciesRichness = c(0L,
0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -5L
))
You can create a dataframe with Site and Richness value and join them together.
In base R :
df1 <- data.frame(Site = rep(c('A', 'B'), c(3, 2)))
df2 <- data.frame(Site = c('A', 'B'), richness = c(3, 5))
df1 <- merge(df1, df2)
df1
# Site richness
#1 A 3
#2 A 3
#3 A 3
#4 B 5
#5 B 5
You can also use match :
df1$richness <- df2$richness[match(df1$Site, df2$Site)]
You could define the values then use case_when
x <- 3
y <- 5
df %>%
mutate(SpeciesRichness= case_when(Site=="A" ~ x,
Site=="B" ~ y))
Output:
Site SpeciesRichness
1 A 3
2 A 3
3 A 3
4 B 5
5 B 5

How to use a dataset to extract specific columns from another dataset?

How to use a dataset to extract specific columns from another dataset?
Use intersect to find common names between two data sets.
snp.common <- intersect(data1$snp, colnames(data2$snp))
data2.separated <- data2[,snp.common]
It's always better to supply a minimal reproducible example:
df1 <- data.frame(V1 = 1:3,
V2 = 4:6,
V3 = 7:9)
df2 <- data.frame(snp = c("V2", "V3"),
stringsAsFactors=FALSE)
Now we can use a character vector to index the columns we want:
df1[, df2$snp]
Returns:
V2 V3
1 4 7
2 5 8
3 6 9
Edit:
Would you know how to do this so that it retains the "i..POP" column in data2?
df1 <- data.frame(ID = letters[1:3],
V1 = 1:3,
V2 = 4:6,
V3 = 7:9)
names(df1)[1] <- "ï..POP"
df2 <- data.frame(snp = c("V2", "V3"),
stringsAsFactors=FALSE)
We can use c to combine the names of the columns:
df1[, c("ï..POP", df2$snp)]
ï..POP V2 V3
1 a 4 7
2 b 5 8
3 c 6 9

Select minimum data of grouped data - keeping all columns [duplicate]

This question already has an answer here:
R: Uniques (or dplyr distinct) + most recent date
(1 answer)
Closed 7 years ago.
I am running into a wall here.
I have a dataframe, many rows.
Here is schematic example.
#myDf
ID c1 c2 myDate
A 1 1 01.01.2015
A 2 2 02.02.2014
A 3 3 03.01.2014
B 4 4 09.09.2009
B 5 5 10.10.2010
C 6 6 06.06.2011
....
I need to group my dataframe by my ID, and then select the row with the oldest date, and write the ouput into a new dataframe - keeping all rows.
ID c1 c2 myDate
A 3 3 03.01.2014
B 4 4 09.09.2009
C 6 6 06.06.2011
....
That is how I approach it:
test <- myDf %>%
group_by(ID) %>%
mutate(date == as.Date(myDate, format = "%d.%m.%Y")) %>%
filter(date == min(b2))
To verfiy: The nrow of my resulting dataframe should be the same as unique returns.
unique(myDf$ID) %>% length == nrow(test)
FALSE
Does not work. I tried this:
newDf <- ddply(.data = myDf,
.variables = "ID",
.fun = function(piece){
take.this.row <- piece$myDate %>% as.Date(format="%d.%m.%Y") %>% which.min
piece[take.this.row,]
})
That does run forever. I terminated it.
Why is the first approach not working and what would be a good way to approach the problem?
Considering you have a pretty large dataset, I think using data.table will be better ! Here is the data.table version to solve your problem, it will be quicker than dplyr package:
library(data.table)
df <- data.table(ID=c("A","A","A","B","B","C"),c1=1:6,c2=1:6,
myDate=c("01.01.2015","02.02.2014",
"03.01.2014","09.09.2009","10.10.2010","06.06.2011"))
df[,myDate:=as.Date(myDate, '%d.%m.%Y')]
> df_new <- df[ df[, .I[myDate == min(myDate)], by=ID]$V1 ]
> df_new
ID c1 c2 myDate
1: A 3 3 2014-01-03
2: B 4 4 2009-09-09
3: C 6 6 2011-06-06
PS: you can use setDT(mydf) to transform data.frame to data.table.
After grouping by 'ID', we can use which.min to get the index of 'myDate' (after converting to Date class), and we extract the rows with slice.
library(dplyr)
df1 %>%
group_by(ID) %>%
slice(which.min(as.Date(myDate, '%d.%m.%Y')))
# ID c1 c2 myDate
# (chr) (int) (int) (chr)
#1 A 3 3 03.01.2014
#2 B 4 4 09.09.2009
#3 C 6 6 06.06.2011
data
df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6,
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014",
"09.09.2009", "10.10.2010", "06.06.2011")), .Names = c("ID",
"c1", "c2", "myDate"), class = "data.frame", row.names = c(NA,
-6L))
If you wanted to just use the base functions you can also go with the aggregate and merge functions.
# data (from response above)
df1 <- structure(list(ID = c("A", "A", "A", "B", "B", "C"), c1 = 1:6,
c2 = 1:6, myDate = c("01.01.2015", "02.02.2014", "03.01.2014",
"09.09.2009", "10.10.2010", "06.06.2011")),
.Names = c("ID","c1", "c2", "myDate"),
class = "data.frame", row.names = c(NA,-6L))
# convert your date column to POSIXct object
df1$myDate = as.POSIXct(df1$myDate,format="%d.%m.%Y")
# Use the aggregate function to look for the minimum dates by group.
# In this case our variable of interest in the myDate column and the
# group to sort by is the "ID" column.
# The function will sort out the minimum date and create a new data frame
# with names "myDate" and "ID"
df2 = aggregate(list(myDate = df1$myDate),list(ID = df1$ID),
function(x){x[which(x == min(x))]})
df2
# Use the merge function to merge your original data frame with the
# data from the aggregate function
merge(df1,df2)

Merging different columns with the same name into single columns

I having a data.frame in which some columns have the same Name. Now I want to merge/add up these columns into single columns. So for example I want to turn....
v1 v1 v1 v2 v2
1 0 2 4 1
3 1 1 1 0
...into...
v1 v2
3 5
5 1
I only found threads dealing with two data.frames supposed to be merged into one but none dealing with this (rather simple?) problem.
The data can be recreated with this:
df <- structure(list(v1 = c(1L, 3L), v1 = 0:1, v1 = c(2L, 1L),
v2 = c(4L, 1L), v2 = c(1L, 0L)),
.Names = c("v1", "v1", "v1", "v2", "v2"),
class = "data.frame", row.names = c(NA, -2L))
as.data.frame(lapply(split.default(df, names(df)), function(x) Reduce(`+`, x)))
produces:
v1 v2
1 3 5
2 5 1
split.default(...) breaks up the data frame into groups with equal column names, then we use Reduce on each of those groups to sum the values of each column in the group iteratively until there is only one column left per group (see ?Reduce, that is what the function does), and finally we convert back to data frame with as.data.frame.
We have to use split.default because split (or really, split.data.frame, which it will dispatch) splits on rows, not columns.
You can do this quite easily with melt and dcast from "reshape2". Since there's no "id" variable, I've used melt(as.matrix(df)) instead of melt(df, id.vars="id"). This automatically creates a long version of your data that has "Var1" as representing your rownames and "Var2" as your colnames. Using that knowledge, you can do:
library(reshape2)
dcast(melt(as.matrix(df)), Var1 ~ Var2,
value.var = "value", fun.aggregate=sum)
# Var1 v1 v2
# 1 1 3 5
# 2 2 5 1

R select the second element in a group

I am trying to find a more R-esque way of selecting the 2nd element (but NOT the first) element of a group in R.
I ended up: 1. creating an index rowNumIndex; 2. selecting and putting the first rows in a one data frame and then the first two rows in a separate data frame; and then 3. "reverse merging" the 2 data frames to get just the unique values from the data frame with the first two rows:
firsts <- ddply(df,.(group), function(x) head(x,1)) # 2 records using data below
seconds <- ddply(df,.(group), function(x) head(x,2)) # 4 records using data below
real.seconds <- seconds[!seconds$rowNumIndex %in% firsts$rowNumIndex, ] # 2 records, the second elements only
Here's some pretend data:
group var1 rowNumIndex
A 8 1
A 9 2
A 10 3
B 11 4
B 12 5
B 13 6
B 14 7
structure(list(group = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L
), .Label = c("A", "B"), class = "factor"), var1 = 8:14, rowNumIndex = 1:7), .Names = c("group",
"var1", "rowNumIndex"), class = "data.frame", row.names = c(NA,
-7L))
So, data frame firsts looks like:
group var1 rowNumIndex
A 8 1
B 11 4
And data frame seconds looks like:
group var1 rowNumIndex
A 8 1
A 9 2
B 11 4
B 12 5
And data frame real.seconds looks like:
group var1 rowNumIndex
A 9 2
B 12 5
Is there a way to do this w/o resorting to, e.g., the index? Thanks in advance for what will undoubtedly be a soul-crushingly simple and elegant solution!
A solution with dplyr:
library(dplyr)
group_by(df, group) %>% slice(2)
# group var1 rowNumIndex
# <fctr> <int> <int>
# 1 A 9 2
# 2 B 12 5
Pre-dplyr 0.3 alternative:
group_by(df, group)%.%filter(seq_along(var1)==2)
group var1 rowNumIndex
1 A 9 2
2 B 12 5
This solution will keep all the columns of the data. If you just want the two columns (group and var), you can do this:
group_by(df, group)%.%summarise(var1[2])
group var1[2]
1 A 9
2 B 12
A solution with split, lapply and do.call
real.seconds<-do.call("rbind", lapply(split(df, df$group), function(x) x[2,]))
This will give you:
real.seconds
group var1 rowNumIndex
A A 9 2
B B 12 5
Or, more elegantly, with by:
real.seconds <- do.call(rbind, by(df, df$group, function(x) x[2, ]))
I would use data.table:
library(data.table)
dt = data.table(df)
dt[,var1[2],by=group]
As I think about it, there's no reason you shouldn't be able to do this with plyr:
ddply(df, .(group), function(x) x[2,])
A base alternative, where only 'var1' is aggregated:
aggregate(var1 ~ group, data = df, `[`, 2)
...or if you wish to aggregate all columns in the data frame, you can use the ''dot notation':
aggregate(. ~ group, data = df, `[`, 2)

Resources