Add values to dataframe based on values in other column - r

I would like to add values to a column based on non-unique values in another column. For example, say I have a dataframe with a currently empty column that looks like this:
Site
Species Richness
A
0
A
0
A
0
B
0
B
0
I want to assign known species richness values for each site. Let's say site A has species richness 3, and site B has species richness 5. I would like the output to be:
Site
Species Richness
A
3
A
3
A
3
B
5
B
5
How do I input species richness values for specific sites?
I've tried this:
rows_update(df, tibble(Site = A, richness = 3))
rows_update(df, tibble(Site = B, richness = 5))
But I get an error message saying "'x' key values are not unique"
Any help would be appreciated!

Here, we could make use of join on from data.table and assign := the corresponding column of 'SpeciesRichness'. It would be more efficient
library(data.table)
setDT(df)[data.table(Site = c('A','B'), SpeciesRichness = c(3, 5)),
SpeciesRichness := i.SpeciesRichness, on = .(Site)]
The issue with ?rows_update is that the by column should be uniquely identifying in both data.
The two tables are matched by a set of key variables whose values must uniquely identify each row.
With 'df', the values are replicated 3 times for 'A' and 2 for 'B'. Using dplyr, we can do a left_join
library(dplyr)
df %>%
left_join(tibble(Site = c('A', "B"), new = c(3, 5))) %>%
transmute(Site, SpeciesRichness = new)
-output
# Site SpeciesRichness
#1 A 3
#2 A 3
#3 A 3
#4 B 5
#5 B 5
data
df <- structure(list(Site = c("A", "A", "A", "B", "B"),
SpeciesRichness = c(0L,
0L, 0L, 0L, 0L)), class = "data.frame", row.names = c(NA, -5L
))

You can create a dataframe with Site and Richness value and join them together.
In base R :
df1 <- data.frame(Site = rep(c('A', 'B'), c(3, 2)))
df2 <- data.frame(Site = c('A', 'B'), richness = c(3, 5))
df1 <- merge(df1, df2)
df1
# Site richness
#1 A 3
#2 A 3
#3 A 3
#4 B 5
#5 B 5
You can also use match :
df1$richness <- df2$richness[match(df1$Site, df2$Site)]

You could define the values then use case_when
x <- 3
y <- 5
df %>%
mutate(SpeciesRichness= case_when(Site=="A" ~ x,
Site=="B" ~ y))
Output:
Site SpeciesRichness
1 A 3
2 A 3
3 A 3
4 B 5
5 B 5

Related

How can I extract a subset of data based on another data frame and grab observations before and after that subset

I have two data frames. df_sub is a subset of the main data frame, df. I want to take a subset of df based on df_sub where the resulting data frame is going to be df_sub plus the observations that occur before and after.
As an example, consider the two data sets
df <- data.frame(var1 = c("a", "x", "x", "y", "z", "t"),
var2 = c(4, 1, 2, 45, 56, 89))
df_sub <- data.frame(var1 = c("x", "y"),
var2 = c(2, 45))
They look like
> df
var1 var2
1 a 4
2 x 1
3 x 2
4 y 45
5 z 56
6 t 89
> df_sub
var1 var2
1 x 2
2 y 45
The result I want would be
> df_result
2 x 1
3 x 2
4 y 45
5 z 56
I was thinking of using an inner_join or something similar
We could use match to get the index, then add or subtract 1 on those index, take the unique and subset the rows
v1 <- na.omit(match(do.call(paste, df_sub), do.call(paste, df)) )
df[unique(v1 + rep(c(-1, 0, 1), each = length(v1))),]
-output
var1 var2
2 x 1
3 x 2
4 y 45
5 z 56
Or create a 'flag' column in the 'df_sub', do a left_join, and then filter based on the lead/lag values of 'flag'
library(dplyr)
df %>%
left_join(df_sub %>%
mutate(flag = TRUE)) %>%
filter(flag|lag(flag)|lead(flag)) %>%
select(-flag)
var1 var2
1 x 1
2 x 2
3 y 45
4 z 56
You can create a row number to keep track of the rows that are selected via join. Subset the data by including minimum row number - 1 and maximum row number + 1.
library(dplyr)
tmp <- df %>%
mutate(row = row_number()) %>%
inner_join(df_sub, by = c("var1", "var2"))
df[c(min(tmp$row) - 1, tmp$row, max(tmp$row) + 1), ]
# var1 var2
#2 x 1
#3 x 2
#4 y 45
#5 z 56

tidyverse alternative to left_join & rows_update when two data frames differ in columns and rows

There might be a *_join version for this I'm missing here, but I have two data frames, where
The merging should happen in the first data frame, hence left_join
I not only want to add columns, but also update existing columns in the first data frame, more specifically: replace NA's in the first data frame by values in the second data frame
The second data frame contains more rows than the first one.
Condition #1 and #2 make left_join fail. Condition #3 makes rows_update fail. So I need to do some steps in between and am wondering if there's an easier solution to get the desired output.
x <- data.frame(id = c(1, 2, 3),
a = c("A", "B", NA))
id a
1 1 A
2 2 B
3 3 <NA>
y <- data.frame(id = c(1, 2, 3, 4),
a = c("A", "B", "C", "D"),
q = c("u", "v", "w", "x"))
id a q
1 1 A u
2 2 B v
3 3 C w
4 4 D x
and the desired output would be:
id a q
1 1 A u
2 2 B v
3 3 C w
I know I can achieve this with the following code, but it looks unnecessarily complicated to me. So is there maybe a more direct approach without having to do the intermediate pipes in the two commands below?
library(tidyverse)
x %>%
left_join(., y %>% select(id, q), by = c("id")) %>%
rows_update(., y %>% filter(id %in% x$id), by = "id")
You can left_join and use coalesce to replace missing values.
library(dplyr)
x %>%
left_join(y, by = 'id') %>%
transmute(id, a = coalesce(a.x, a.y), q)
# id a q
#1 1 A u
#2 2 B v
#3 3 C w

How to Transform R Dataframe factor into Indicator Varible using index

I want to Transform R Dataframe factor into Indicator Variable using some index in R.
Given following representation
StudentID Subject
1 A
1 B
2 A
2 C
3 A
3 B
I need following representation using StudentID as index
StudentID SubjectA SubjectB SubjectC
1 1 1 0
2 1 0 1
3 1 1 0
We can use table
table(df1)
# Subject
#StudentID A B C
# 1 1 1 0
# 2 1 0 1
# 3 1 1 0
If we need a data.frame
as.data.frame.matrix(table(df1))
Here's how I got it, using dcast from reshape2 as suggested in the comment above
library(reshape2)
ID <- c(1, 1, 2, 2, 3, 3)
Subject <- c('A', 'B', 'A', 'C', 'A', 'B')
data <- data.frame(ID, Subject)
data <- dcast(data, ID ~ Subject)
data[is.na(data)] <- 0
f <- function(x) {
x <- gsub('[A-Z]', 1, x)
}
as.data.frame(apply(data, 2, f))
# ID A B C
#1 1 1 1 0
#2 2 1 0 1
#3 3 1 1 0
Now that I look at this solution it may not be very efficient. But it is much more dynamic than some other solutions. There might also be a way to use data.table directly but I cannot figure it out. This might help though:
library(data.table)
df <- structure(list(StudentID = c(1, 1, 2, 2, 3, 3),
Subject = structure(c(1L,
2L, 1L, 3L, 1L, 2L), .Label = c("A", "B", "C"), class = "factor")), .Names = c("StudentID",
"Subject"), row.names = c(NA, -6L), class = "data.frame")
df <- data.table(df)
### here we pull the unique student id's to use in group by
studentid <- as.character(unique(df$Subject))
### here we group by student ID's and paste which Subjects exist
x <- df[,list("Values"=paste(Subject,collapse="_")),by=StudentID]
### then we go through each one and try to match it to the unique vector
tmp <- strsplit(x$Values,"_")
res <- do.call(rbind,lapply(tmp,function(i) match(studentid,i)))
### change the results to the indicator variable desired
res[!is.na(res)] <- 1
res[is.na(res)] <- 0
res <- data.frame("StudentID"=x$StudentID,res)
colnames(res) <- c("StudentID",studentid)

cumsum when current obs equals next obs for same variable (column)

I want to add a column to a dataframe that makes a cumulated sum of another variable if yet another variable is equal for two rows. For example:
Row Var1 Var2 CumVal
1 A 2 2
2 A 4 6
3 B 5 5
So I want CumVal to cumulate/sum the Var2 column, if Var1 obs for row 2 equals Var1 obs for row 1. With other words, if it is equal to the obs before.
If the cumsum is based on the Var1 as a grouping variable
library(dplyr)
df %>%
group_by(Var1) %>%
mutate(CumVal=cumsum(Var2))
Or
library(data.table)
setDT(df)[, CumVal:=cumsum(Var2), by=Var1]
Or using base R
transform(df, CumVal=ave(Var2, Var1, FUN=cumsum))
Update
If it is based on whether adjacent elements are not equal
transform(df, CumVal= ave(Var2, cumsum(c(TRUE,Var1[-1]!=
Var1[-nrow(df)])), FUN=cumsum))
# Row Var1 Var2 CumVal
#1 1 A 2 2
#2 2 A 4 6
#3 3 B 5 5
#4 4 A 6 6
Or the dplyr approach
df %>%
group_by(indx= cumsum(c(TRUE,(lag(Var1)!=Var1)[-1]))) %>%
mutate(CumVal=cumsum(Var2)) %>%
ungroup() %>%
select(-indx)
data
df <- structure(list(Row = 1:4, Var1 = c("A", "A", "B", "A"), Var2 = c(2L,
4L, 5L, 6L)), .Names = c("Row", "Var1", "Var2"), class = "data.frame",
row.names = c(NA, -4L))
I like rle, which detects similar successive values in a vector and describe it in a nice synthetic way. E.g. let's say we have a vector x of length 10:
x <- c(2, 3, 2, 2, 2, 2, 0, 0, 2, 1)
rle is able to detect that there are 4 successive 2s and 2 successive 0s:
rle(x)
# Run Length Encoding
# lengths: int [1:6] 1 1 4 2 1 1
# values : num [1:6] 2 3 2 0 2 1
(in the output, we can that there are 2 lengths different from 1 corresponding to values 4 and 2)
We can use this function to apply cumsum to subvectors of another vector. Let's say we want to apply cumcum on a new vector y <- 1:10, but only for repeated values of x (which will be stored in a factor f):
y <- 1:10
z <- rle(x)$lengths
f <- factor(rep( seq_along(z), z) )
We can then use by or tapply (or something else to achieve the desired output):
cumval <- unlist(tapply(y, f, cumsum))

Append 0 to missing observations in a dataframe.

I have a dataset where I expect a fixed number of observations in a data-frame
A 20
B 10
C 5
However, upon running my analysis this is not always the case sometimes I find missing observations and the resulting dataframe looks like this
A 10
C 5
In this case there are no observations for B. I would want to append 0 observations to the final dataframe before ploting so as to indicate the values of the missing observation.
final data frame should look like this
A 10
B 0
C 5
How can I accomplish this in R?
If you define the ID column (with A,B,C) as factor which seems appropriate here, you could plot the data and even those factor levels which are not in the data (but in the defined factor levels) will be plotted. Here's a small example:
df <- data.frame(ID = LETTERS[1:3], x = rnorm(3))
df
# ID x
#1 A 1.350458
#2 B 1.340855
#3 C 1.311329
subdf <- df[c(1,3),]
subdf
# ID x
#1 A 1.350458
#3 C 1.311329
with(subdf, plot(x ~ ID))
You'll find that "B" is also present in the plot although it's not in the subsetted data.
Maybe you can do something with melt and dcast from "reshape2".
Here's what I had in mind:
library(reshape2)
out <- dcast(
melt( # Makes a data.frame from a list
mget(ls(pattern = "df\\d")), # Collects the relevant df in a list
id.vars = "V1"), # The variable to melt by
L1 ~ V1, value.var = "value", fill = 0) # Other options for dcast
out
# L1 A B C
# 1 df1 20 10 5
# 2 df2 10 0 5
From there, you could go back to a long data form.
melt(out, id.vars = "L1")
# L1 variable value
# 1 df1 A 20
# 2 df2 A 10
# 3 df1 B 10
# 4 df2 B 0
# 5 df1 C 5
# 6 df2 C 5
If separate data.frames are required, then you can also look at using split, but if you are just going to be plotting, this format should work just fine.
Sample data
df1 <- structure(list(V1 = c("A", "B", "C"), V2 = c(20L, 10L, 5L)),
.Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -3L))
df2 <- structure(list(V1 = c("A", "C"), V2 = c(10L, 5L)),
.Names = c("V1", "V2"), class = "data.frame",
row.names = c(NA, -2L))

Resources