Creating "duplicate" rows with one variable in R - r

Had a hard time naming this question, as I'm not sure what to call this maneuver. However, the idea is fairly simple. I have a dataframe in which some of the values are vectors.
letters <- c("a", "b", "c", "d", "e")
numbers <- list(34, 23, c(23, 34, 45), 23, c(45,56,43,12))
df <- data.frame(letters)
df$numbers <- numbers
df
letters numbers
1 a 34
2 b 23
3 c c(23, 34, 45)
4 d 23
5 e c(45, 56, 43, 12)
What I want to obtain is a data.frame that duplicates all rows that contain vectors in the column numbers by the number of objects in those vectors. They must be exact duplicates except for the numbers column, which should be variable. Like so:
letters numbers
1 a 34
2 b 23
3 c 23
4 c 24
5 c 45
6 d 23
7 e 55
8 e 56
9 e 43
10 e 12
Any easy solution to this?

We can use unnest
library(tidyr)
unnest(df, numbers)
# letters numbers
# (fctr) (dbl)
#1 a 34
#2 b 23
#3 c 23
#4 c 34
#5 c 45
#6 d 23
#7 e 45
#8 e 56
#9 e 43
#10 e 12

The tidyr option is the neatest, but if you want to use base R, you can do:
stack(setNames(numbers, letters))
# values ind
# 1 34 a
# 2 23 b
# 3 23 c
# 4 34 c
# 5 45 c
# 6 23 d
# 7 45 e
# 8 56 e
# 9 43 e
# 10 12 e

Related

How to calculate the sum of specific columns in R and make the results in a another column

I'm a beginner in biostatistics and R software, and I need your help in a issue,
I have a table that contains more than 170 columns and more than 6000 lines, I want to add another column that contains the sum of all the columns, except the columns one and two columns
so for example if I have the data of 5 columns from A to E
A B C D E
12 2 13 98 6
10 7 8 67 12
12 56 67 9 7
I want to add another column (Column F for example ) that contain the sum of columns C D and E ( that means all the columns except the first two columns
so the result will be
A B C D E F
AA 2 13 98 6 117
CF 7 8 67 12 87
QZ 56 67 9 7 83
Please tell me if you want to know any other informations or clarification
Thank you very much
Does this work:
library(dplyr)
df %>% rowwise() %>% mutate(F = sum(c_across(-c(A:B))))
# A tibble: 3 x 6
# Rowwise:
A B C D E F
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 12 2 13 98 6 117
2 10 7 8 67 12 87
3 12 56 67 9 7 83
Data used:
df
# A tibble: 3 x 5
A B C D E
<dbl> <dbl> <dbl> <dbl> <dbl>
1 12 2 13 98 6
2 10 7 8 67 12
3 12 56 67 9 7
library(tibble)
library(dplyr)
tbl <-
tibble::tribble(
~A, ~B, ~C, ~D, ~E,
12, 2, 13, 98, 6,
10, 7, 8, 67, 12,
12, 56, 67, 9, 7
)
tbl %>% dplyr::mutate("F" = C + D + E )
## R might consider F to be abbreviation for FALSE, so i put it in ""
You will find the information you need in the top answer to this question:
stackoverflow.com/questions/3991905/sum-rows-in-data-frame-or-matrix
Basically, you just name your new column, use the rowSums function, and specify the columns you want to include with the square bracket subsetting.
data$new <- rowSums( data[,43:167] )

Remove duplicated columns [duplicate]

This question already has answers here:
Identifying duplicate columns in a dataframe
(10 answers)
Closed 4 years ago.
I have data spanning a couple decades that I'm reading in from yearly files and row binding. I've found that on occasion I end up with columns that have duplicate values, and I'd like to remove the duplicate columns. This has to happen over very large tables (millions of rows, hundreds of columns) so doing a pairwise check is likely not feasible.
Example data:
df <- data.frame(id = c(1:6), x = c(15, 21, 14, 21, 14, 38), y = c(36, 38, 55, 11, 5, 18), z = c(15, 21, 14, 21, 14, 38), a = c("D", "B", "A", "F", "H", "P"))
> df
id x y z a
1 1 15 36 15 D
2 2 21 38 21 B
3 3 14 55 14 A
4 4 21 11 21 F
5 5 14 5 14 H
6 6 38 18 38 P
z is a duplicate of x, so should be removed. Desired result:
> df2
id x y a
1 1 15 36 D
2 2 21 38 B
3 3 14 55 A
4 4 21 11 F
5 5 14 5 H
6 6 38 18 P
We can apply duplicated on the transposed dataset and use it to subset the columns
df[!duplicated(t(df))]
# id x y a
#1 1 15 36 D
#2 2 21 38 B
#3 3 14 55 A
#4 4 21 11 F
#5 5 14 5 H
#6 6 38 18 P

How can we apply a function to a column vector from every set of contiguously matching rows of a data frame

For example, using column 1 as the matching criterion, lets call replicate(length(v), sum(v)) for the column 2 vector, v, of every set of rows that consists of contiguous and matching rows from the data frame A (including sets of size 1).
A v
a 12
a 43
b 8
a 4
b 12
c 5
c 9
d 21
->
55, 55, 8, 4, 12, 14, 14, 21
The operation can return a vector or a list of vectors that we can coerce to a vector with unlist().
Here's a simple solution using data.table - simply because of it's built in rleid function and because it handles factors seemingly
library(data.table)
setDT(df)[, res := sum(v), by = rleid(A)]
df
# A v res
# 1: a 12 55
# 2: a 43 55
# 3: b 8 8
# 4: a 4 4
# 5: b 12 12
# 6: c 5 14
# 7: c 9 14
# 8: d 21 21
If we want base R we could either recreate rleid or just combine cumsum with ave
with(df, ave(v, cumsum(c(TRUE, head(A, -1) != tail(A, -1))), FUN = sum))
# [1] 55 55 8 4 12 14 14 21
Here is an option using dplyr
library(dplyr)
df1 %>%
group_by(A1 = cumsum(A!= dplyr::lag(A, default=A[1]))) %>%
mutate(res = sum(v)) %>%
ungroup() %>%
select(-A1)
# A v res
# (chr) (int) (int)
#1 a 12 55
#2 a 43 55
#3 b 8 8
#4 a 4 4
#5 b 12 12
#6 c 5 14
#7 c 9 14
#8 d 21 21

How to sum over diagonals of data frame

Say that I have this data frame:
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2
In this above data frame, the values indicate counts of how many observations take on (100, 1), (99, 1), etc.
In my context, the diagonals have the same meanings:
1 2 3 4
100 A B C D
99 B C D E
98 C D E F
97 D E F G
How would I sum across the diagonals (i.e., sum the counts of the like letters) in the first data frame?
This would produce:
group sum
A 8
B 13
C 13
D 28
E 10
F 18
G 2
For example, D is 5+5+4+14
You can use row() and col() to identify row/column relationships.
m <- read.table(text="
1 2 3 4
100 8 12 5 14
99 1 6 4 3
98 2 5 4 11
97 5 3 7 2")
vals <- sapply(2:8,
function(j) sum(m[row(m)+col(m)==j]))
or (as suggested in comments by ?#thelatemail)
vals <- sapply(split(as.matrix(m), row(m) + col(m)), sum)
data.frame(group=LETTERS[seq_along(vals)],sum=vals)
or (#Frank)
data.frame(vals = tapply(as.matrix(m),
(LETTERS[row(m) + col(m)-1]), sum))
as.matrix() is required to make split() work correctly ...
Another aggregate variation, avoiding the formula interface, which actually complicates matters in this instance:
aggregate(list(Sum=unlist(dat)), list(Group=LETTERS[c(row(dat) + col(dat))-1]), FUN=sum)
# Group Sum
#1 A 8
#2 B 13
#3 C 13
#4 D 28
#5 E 10
#6 F 18
#7 G 2
Another solution using bgoldst's definition of df1 and df2
sapply(unique(c(as.matrix(df2))),
function(x) sum(df1[df2 == x]))
Gives
#A B C D E F G
#8 13 13 28 10 18 2
(Not quite the format that you wanted, but maybe it's ok...)
Here's a solution using stack(), and aggregate(), although it requires the second data.frame contain character vectors, as opposed to factors (could be forced with lapply(df2,as.character)):
df1 <- data.frame(a=c(8,1,2,5), b=c(12,6,5,3), c=c(5,4,4,7), d=c(14,3,11,2) );
df2 <- data.frame(a=c('A','B','C','D'), b=c('B','C','D','E'), c=c('C','D','E','F'), d=c('D','E','F','G'), stringsAsFactors=F );
aggregate(sum~group,data.frame(sum=stack(df1)[,1],group=stack(df2)[,1]),sum);
## group sum
## 1 A 8
## 2 B 13
## 3 C 13
## 4 D 28
## 5 E 10
## 6 F 18
## 7 G 2

Loop for Multiple Vlookup in R

I have a data frame that looks like this:
lhs1=c("A","D","C","B")
lhs2=c("B","A","C","I")
lhs3=c("I","B","A","D")
lhs4=c("A","C","B","D")
df <- data.frame(lhs1,lhs2,lhs3,lhs4)
lhs1 lhs2 lhs3 lhs4
1 A B I A
2 D A B C
3 C C A B
4 B I D D
And I want to add four more columns that shows the sale of each letter from base on the value on this data frame:
category <- c("A","B","C","D","E","I")
sale <- c(12,23,34,35,38,42)
look <- data.frame(category,sale)
category sale
A 12
B 23
C 34
D 35
E 38
I 42
So my data frame will look like this:
lhs1 lhs2 lhs3 lhs4 lhs1.sale lhs2.sale lhs3.sale lhs4.sale
A B I A 12 23 42 12
D A B C 35 12 23 34
C C A B 34 34 12 23
B I D D 23 42 35 35
Kindly help me create a loop than can create multiple vlookup for R.
Try this
df[paste(names(df), "sale", sep = ".")] <- look$sale[match(unlist(df), look$category)]
df
# lhs1 lhs2 lhs3 lhs4 lhs1.sale lhs2.sale lhs3.sale lhs4.sale
# 1 A B I A 12 23 42 12
# 2 D A B C 35 12 23 34
# 3 C C A B 34 34 12 23
# 4 B I D D 23 42 35 35
Here's a data.table solution.
library(data.table)
setkey(setDT(look),category) # convert look to data.table; index on category
cn <- paste0(names(df),".sales") # names for the new columns
setDT(df)[,c(cn):=lapply(.SD,function(col)look[col]$sale)]
df
# lhs1 lhs2 lhs3 lhs4 lhs1.sales lhs2.sales lhs3.sales lhs4.sales
# 1: A B I A 12 23 42 12
# 2: D A B C 35 12 23 34
# 3: C C A B 34 34 12 23
# 4: B I D D 23 42 35 35

Resources