Add new variable to specific position in dataframe - r

I have a DF where I want to add a new variable called "B" into the 2nd position.
A C D
1 1 5 2
2 3 3 7
3 6 2 3
4 6 4 8
5 1 1 2
Anyone have an idea?

The easiest way would be to add the columns you want and then reorder them:
dat$B <- 1:5
newdat <- dat[, c("A", "B", "C", "D")]
Another way:
newdat <- cbind(dat[1], B=1:5, dat[,2:3])
If you're concerned about overhead, perhaps a data.table solution? (With help from this answer):
library(data.table)
dattable <- data.table(dat)
dattable[,B:=1:5]
setcolorder(dattable, c("A", "B", "C", "D"))

dat$B <- 1:5
ind <- c(1:which(names(data) == "A"),ncol(data),(which(names(data) == "A")+1):ncol(data)-1)
data <- data[,ind]
Create the variable at the end of the data.frame and then using an indicator vector signaling how to reorder the columns. ind is just a vector of numbers

Related

Count of number of elements between distinct elements in vector

Suppose I have a vector of values, such as:
A C A B A C C B B C C A A A B B B B C A
I would like to create a new vector that, for each element, contains the number of elements since that element was last seen. So, for the vector above,
NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
(where NA indicates that this is the first time the element has been seen).
For example, the first and second A are in position 1 and 3 respectively, a difference of 2; the third and fourth A are in position 4 and 11, a difference of 7, and so on.
Is there a pre-built pipe-compatible function that does this?
I hacked together this function to demonstrate:
# For reproducibility
set.seed(1)
# Example vector
x = sample(LETTERS[1:3], size = 20, replace = TRUE)
compute_lag_counts = function(x, first_time = NA){
# return vector to fill
lag_counts = rep(-1, length(x))
# values to match
vals = unique(x)
# find all positions of all elements in the target vector
match_list = grr::matches(vals, x, list = TRUE)
# compute the lags, then put them in the appropriate place in the return vector
for(i in seq_along(match_list))
lag_counts[x == vals[i]] = c(first_time, diff(sort(match_list[[i]])))
# return vector
return(lag_counts)
}
compute_lag_counts(x)
Although it seems to do what it is supposed to do, I'd rather use someone else's efficient, well-tested solution! My searching has turned up empty, which is surprising to me given that it seems like a common task.
Or
ave(seq.int(x), x, FUN = function(x) c(NA, diff(x)))
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
We calculate the first difference of the indices for each group of x.
A data.table option thanks to #Henrik
library(data.table)
dt = data.table(x)
dt[ , d := .I - shift(.I), x]
dt
Here's a function that would work
compute_lag_counts <- function(x) {
seqs <- split(seq_along(x), x)
unsplit(Map(function(i) c(NA, diff(i)), seqs), x)
}
compute_lag_counts (x)
# [1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
Basically you use split() to separate the indexes where values appear by each unique value in your vector. Then we use the different between the index where they appear to calculate the distance to the previous value. Then we use unstack to put those values back in the original order.
An option with dplyr by taking the difference of adjacent sequence elements after grouping by the original vector
library(dplyr)
tibble(v1) %>%
mutate(ind = row_number()) %>%
group_by(v1) %>%
mutate(new = ind - lag(ind)) %>%
pull(new)
#[1] NA NA 2 NA 2 4 1 4 1 3 1 7 1 1 6 1 1 1 8 6
data
v1 <- c("A", "C", "A", "B", "A", "C", "C", "B", "B", "C", "C", "A",
"A", "A", "B", "B", "B", "B", "C", "A")

R data frame subsetting based on a column value frequency threshold [duplicate]

This question already has answers here:
Getting the top values by group
(6 answers)
Closed 6 years ago.
I am a new R user and this is my first question submission (hopefully in compliance with the protocol).
I have a data frame with two columns.
df <- data.frame(v1 = c("A", "A", "B", "B", "B", "B", "C", "D", "D", "E" ))
dfc <- df %>% count(v1)
df$n <- with(dfc, n[match(df$v1,v1)])
v1 n
1 A 2
2 A 2
3 B 4
4 B 4
5 B 4
6 B 4
7 C 1
8 D 2
9 D 2
10 E 1
I want to delete rows that exceed a threshold of 3 occurrences for a value in v1. All rows for that value less than the threshold are retained. In this example I want to delete row 6 and retain all remaining rows in a subset data frame.
The result would include the following values for v1:
v1
1 A
2 A
3 B
4 B
5 B
6 C
7 D
8 D
9 E
Row 6 would have been deleted because it was the 4th occurrence of "B", but the 3 previous rows for "B" have been retained.
I have read multiple posts that demonstrate how to remove ALL rows for a variable with row totals less/greater than a cumulative frequency value, such as 4. For example, I have tried:
df1 <- df %>%
group_by(v1) %>%
filter(n() < 4)
This approach keeps only the rows where all unique occurrences of V1 are < 4. 6 rows are subset.
df2 <- df %>%
group_by(v1) %>%
filter(n() > 3)
This approach keeps only the rows where all unique occurrences of v1 are > 3. 4 rows are subset.
df4 <- subset(df, v1 %in% names(table(df$v1))[table(df$v1) <4])
This approach has the same result as the first approach.
None of these methods produce the result I need.
As previously stated, I need to retain the first three rows where v1="B" and only delete rows if there are > 3 occurrences of that value.
Because I am new to R, it's possible I am overlooking a very simple solution. Any suggestions would be greatly appreciated.
Thanks.
Using dplyr's top_n:
df %>% group_by(v1) %>% top_n(3)
This seems to do it:
index <- vector("numeric", nrow(df))
for (i in 1:nrow(df)) {
if (sum(df[1:i, ] == as.character(df[i, 1])) <= 3) {
index[i] <- i
} else {
cat(i)
}
}
df[index, ]
v1 n
1 A 2
2 A 2
3 B 4
4 B 4
5 B 4
7 C 1
8 D 2
9 D 2
10 E 1
We can use data.table
library(data.table)
setDT(df)[, if(.N >3) head(.SD, 3) else .SD , v1]

Find start and end of ranges where data is upper case

I have a data.frame ystr:
v1
1 a
2 B
3 B
4 C
5 d
6 a
7 B
8 D
I want to find the start and end of each group of letters in CAPS so my output would be:
groupId startPos endPos
1 1 2 4
2 2 7 8
I was able to do it with a for loop by looking at each element in order and comparing it to the one before as follows:
currentGroupId <-0
for (i in 1:length(ystr[,1])){
if (grepl("[[:upper:]]", ystr[i,]))
{
if (startCounter == 0)
{
currentGroupId <- currentGroupId +1
startCounter <-1
mygroups[currentGroupId,] <- c(currentGroupId, i, 0)
}
}else if (startCounter == 1){
startCounter <-0
mygroups[currentGroupId,3]<- i-1
}
}
Is there a simple way of doing this in R?
This might be similar to Mark start and end of groups but I could not figure out how it would apply in this case.
You can do this by calculating the run-length encoding (rle) of the binary indicator for whether your data is upper case, as determined by whether the data is equal to itself when it's converted to upper case.
with(rle(d[,1] == toupper(d[,1])),
data.frame(start=cumsum(lengths)[values]-lengths[values]+1,
end=cumsum(lengths)[values]))
# start end
# 1 2 4
# 2 7 8
You can see other examples of the use of rle by looking at Stack Overflow answers using this command.
Data:
d <- data.frame(v1=c("a", "B", "B", "C", "d", "a", "B", "D"))
You can use the IRanges package. It's basically to find the consecutive ranges.
d <- data.frame(v1=c("a", "B", "B", "C", "d", "a", "B", "D"))
d.idx <- which(d$v1 %in% LETTERS)
d.idx
# [1] 2 3 4 7 8
library(IRanges)
d.idx.ir <- IRanges(d.idx, d.idx)
reduce(d.idx.ir)
# IRanges of length 2
# start end width
# [1] 2 4 3
# [2] 7 8 2

Assigning numeric values to character elements of data frame

I have 16*3 data frame. Elements in data frame are character e.g., A, B, C... How can I assign them values e.g., A= 2, B=5, C=4 in R?
You can map the values from the vector you created:
relevel <- function(df, levelmap) {
df[] <- lapply(df, function(x) levelmap[as.character(x)]);df
}
The function subsets the values based on the map vector.
Example
df <- data.frame(x=c("A", "C", "C", "A"), y=c("B", "C", "B", "A"), z=c("A", "B", "C", "A"))
df
x y z
1 A B A
2 C C B
3 C B C
4 A A A
newlevels <- c(A=2,B=5,C=4)
relevel(df, newlevels)
x y z
1 2 5 2
2 4 4 5
3 4 5 4
4 2 2 2
The newlevels vector is a special vector called a named vector. It's very helpful as it can be referenced by both its names and its indices. newlevels["A"] and newlevels[1] both return the same output. This simplifies what in other languages would require hash tables or other lookup arrays.

R - How to apply different functions to certain rows in a column

I am trying to apply different functions to different rows based on the value of a string in an adjacent column. My dataframe looks like this:
type size
A 1
B 3
A 4
C 2
C 5
A 4
B 32
C 3
and I want to apply different functions to types A, B, and C, to give a third column column "size2." For example, let's say the following functions apply to A, B, and C:
for A: size2 = 3*size
for B: size2 = size
for C: size2 = 2*size
I'm able to do this for each type separately using this code
df$size2 <- ifelse(df$type == "A", 3*df$size, NA)
df$size2 <- ifelse(df$type == "B", 1*df$size, NA)
df$size2 <- ifelse(df$type == "C", 2*df$size, NA)
However, I can't seem to do it for all of the types without erasing all of the other values. I tried to use this code to limit the application of the function to only those values that were NA (i.e., keep existing values and only fill in NA values), but it didn't work using this code:
df$size2 <- ifelse(is.na(df$size2), ifelse(df$type == "C", 2*df$size, NA), NA)
Does anyone have any ideas? Is it possible to use some kind of AND statement with "is.na(df$size2)" and "ifelse(df$type == "C""?
Many thanks!
This might be a might more R-ish (and I called my dataframe 'dat' instead of 'df' since df is a commonly used function.
> facs <- c(3,1,2)
> dat$size2= dat$size* facs[ match( dat$type, c("A","B","C") ) ]
> dat
type size size2
1 A 1 3
2 B 3 3
3 A 4 12
4 C 2 4
5 C 5 10
6 A 4 12
7 B 32 32
8 C 3 6
The match function is used to construct indexes to supply to the extract function [.
if you want you can nest the ifelses:
df$size2 <- ifelse(df$type == "A", 3*df$size,
ifelse(df$type == "B", 1*df$size,
ifelse(df$type == "C", 2*df$size, NA)))
# > df
# type size size2
#1 A 1 3
#2 B 3 3
#3 A 4 12
#4 C 2 4
#5 C 5 10
#6 A 4 12
#7 B 32 32
#8 C 3 6
This could do it like this, creating separate logical vectors for each type:
As <- df$type == 'A'
Bs <- df$type == 'B'
Cs <- df$type == 'C'
df$size2[As] <- 3*df$size[As]
df$size2[Bs] <- df$size[Bs]
df$size2[Cs] <- 2*df$size[Cs]
but a more direct approach would be to create a separate lookup table like this:
df$size2 <- c(A=3,B=1,C=2)[as.character(df$type)] * df$size

Resources