How to deal with duplicates using Excel/R?

How to deal with duplicates using Excel/R? - r

I have data that looks like this in excel:
Name Value
A 10
B 20
A 30
C 40
E 50
D 60
E 70
F 80
A 90
B 100
And it goes down for hundreds of rows, while i know how to delete duplicates, is there a way in excel or r to differentiate the duplicates?
My final table would ideally look like this
Name Value
A 10
B 20
A_1 30
C 40
E 50
D 60
E_1 70
F 80
A_2 90
B_1 100

We can use make.unique in R
df$Name <- make.unique(df$Name, sep='_')

Use this formula:
=A2& IF(COUNTIF($A$2:$A2,A2)=1,"","_"&COUNTIF($A$2:$A2,A2)-1)
See below screenshot.

Related

Perform row-wise operation in datatable with multiple elements

I have the following data table:
library(data.table)
set.seed(1)
DT <- data.table(ind=1:100,x=sample(100),y=sample(100),group=c(rep("A",50),rep("B",50)))
Now the problem I have is that I need to take every value in column "x" (that is, each given ID), and add all the existing values in column "y" to it. I also need to do it separately per column "group". Let's assume we start with ID = 1. This element has the value: x_1 = 68, and y_1 = 76. We also see y_2 = 39, y_3 = 24, etc. So what I want to compute is the sums x_1 + y_1, x_1 + y2, x_1 + y_3, etc. But not only for x_1, but also for x_2, x_3, etc. So for x_2 it would look like: x_2 + y_1, x_2 + y_2, x_2 + y_3, etc. This should also be done separately per column "group" (in this regard the dataset should simple be split by group).
Edit: Exemplary code to do this only for X_1 and group A:
current_X <- DT[1,x] # not needed, just to illustrate
vector_current_X <- rep(DT[1,x],nrow(DT[group == "A"]))
DT[group == "A",copy_current_X := vector_current_X]
DT[,sum_current_X_Y := copy_current_X + y]
DT
One apparent issue with this approach is that if it were applied to all x, then a lot of columns would be added to the final DT. So I am not sure if it is the best approach. In the end, I am just looking for the lowest sum (per element x) with each element y, and per group.
I know how to do operations per group, and I also know the lapply functions. The issue is that from my understanding, I need to include a row-wise loop. And next, the structure of the result will be different from the original data table, because we have many additional observations. I have seen before that you can save lists inside a data.table, but I am unsure if that is the best approach. My dataset is much larger, so efficiency is important.
Thanks for any hints how to approach this.

You can do this:
DT[, .(.BY$x+DT[group==.BY$group,y]), by=.(x,group)]
This returns N rows per x, where N is the size of x's group. We leverage the special (.BY), which is available in j when utilizing by. Basically, .BY is a named list, containing the values of the grouping variables. Here, I'm adding the value of x (.BY$x) to the vector of y values from the subset of DT where the group is equal to the current group value (.BY$group)
Output:
x group V1
<int> <char> <int>
1: 68 A 144
2: 68 A 107
3: 68 A 92
4: 68 A 121
5: 68 A 160
---
4996: 4 B 25
4997: 4 B 66
4998: 4 B 83
4999: 4 B 27
5000: 4 B 68
You can also accomplish this via a join:
DT[,!c("y")][DT[, .(y,group)], on=.(group), allow.cartesian=T][, total:=x+y][order(ind)]
Output:
ind x group y total
<int> <int> <char> <int> <int>
1: 1 68 A 76 144
2: 1 68 A 39 107
3: 1 68 A 24 92
4: 1 68 A 53 121
5: 1 68 A 92 160
---
4996: 100 4 B 21 25
4997: 100 4 B 62 66
4998: 100 4 B 79 83
4999: 100 4 B 23 27
5000: 100 4 B 64 68

If I understand correctly, the requested result requires a cross join where each element of x is combined with each element of y (within each group).
This can be accomplished easily using the CJ() function:
DT[, CJ(x, y, sorted = FALSE), by = group][, sum_x_y := x + y][]
group x y sum_x_y
1: A 68 76 144
2: A 68 39 107
3: A 68 24 92
4: A 68 53 121
5: A 68 92 160
---
4996: B 4 21 25
4997: B 4 62 66
4998: B 4 79 83
4999: B 4 23 27
5000: B 4 64 68

How to use Logical indexing and min function to find the row which has min value?

So, I know how to find it using the subset function. Is there any way not to use subset function?
Example dataset:
Month A B
J 67 89
F 48 69
M 78 89
A 54 90
M 54 75
So, lets say I need to write a code to find the min value in Column B.
My Code: subset(df, B == min(df)
My question:
How to use Logical indexing and min function for this dataset? I don't wanna use subset.

You can use which to find the postitions.
x <- c(2,1,3,1)
which(x == min(x))
#[1] 2 4
To get the first hit which.min could be used.
which.min(x)
#[1] 2
With the given data set.
x <- read.table(header=TRUE, text="Month A B
J 67 89
F 48 69
M 78 89
A 54 90
M 54 75")
which(x$B == min(x$B))
#[1] 2
which(x[2:3] == min(x[2:3]), TRUE)
# row col
#[1,] 2 1

Substraction column to another column of next row in R

I want to calculate the subtraction of c= b-a, and d= b2-a3... How can I do that?
name a b c=b-a d=b2-a3…
peter 80 100 20 30
dancy 70 90 20 20
tiger 70 85 15 20
pop 85 101 16 29
rock 72 111 39
enter image description here
Thank you so much!

Presuming your data is in a dataframe, to recreate your column using a tidyverse approach you could do:
library(tidyverse)
yourdata <- yourdata %>%
mutate(c = b - a,
d = b - lead(a))
To do the opposite you can use lag, to increase the number of steps in either lag or lead you can use lag(column_name, n = number of steps).

Here is an option with data.table
library(data.table)
setDT(df1)[, c := b - a][, d := b - shift(a, type = 'lead')]

Calculate from one cell in first row

This is probably an easy fix but I'm struggling with something easy in Excel but not so obvious in R. Have done a fair bit of Googling and searching here but no success.
My data (df) looks like this:
Amount Amount
1 25 100
2 33 ?
3 18 ?
4 27 ?
What I want to do is put a formula in cell B3 which is B2 + A3.
Whilst this is simple to do in r, where I'm struggling is to put B2 as a base figure and then to work from there. Therefore, B3 would be 133, B4 151 etc.
Appreciate any help.

Assuming you have a dataframe like this
df <- data.frame(Amount = c(25,33,18, 27))
df
# Amount
#1 25
#2 33
#3 18
#4 27
and you want to start your cumulative sum from 100, you could do
df$cumsumAmount <- cumsum(c(100, df$Amount[2:nrow(df)]))
df
# Amount cumsumAmount
#1 25 100
#2 33 133
#3 18 151
#4 27 178

multiplying columns in R

I have a data frame like this.
> abc
ID 1.x 2.x 1.y 2.y
1 4 10 20 30 40
2 16 5 10 5 10
3 42 16 17 18 19
4 91 20 20 20 20
5 103 103 42 56 84
How do I create two additional columns '1' and '2' by multiplying 1.x * 1.y and 2.x * 2.y in a generalized way?
I am trying to get a generalized solution where number of columns can be too many. So I want to multiply all x with all y. While x and y are fixed, n has to be figured out from data frame.
For simplicity lets assume n is also fixed however it is a large number.
One thing i can try is :-
abc[,c(6,7)]=abc[,c(2,3)]*abc[,c(4,5)]
It will work only if col positions are contiguous. This is good enough for me. If anyone can have more generalized solution, it will benefit us all.

If there are only couple of variables to multiply, we can do this with transform by multiplying the columns of interest
transform(abc, new1 = `1.x`*`1.y`, new2 = `2.x`*`2.y`, check.names = FALSE)
# ID 1.x 2.x 1.y 2.y new1 new2
#1 4 10 20 30 40 300 800
#2 16 5 10 5 10 25 100
#3 42 16 17 18 19 288 323
#4 91 20 20 20 20 400 400
#5 103 103 42 56 84 5768 3528
If we have lots of columns, then one approach is to split the dataset into a list of data.frames by taking the substring of names and then loop through the list and multiply the rows with do.call
abc[paste0("new", 1:2)] <- lapply(split.default(abc[-1],
sub("\\.[a-z]+$", "", names(abc)[-1])), function(x) do.call(`*`, x))
Or another option is (based on the pairwise column multiplication)
apply(aperm(array(unlist(abc[-1]), c(5, 2, 2)),
c(3, 1, 2)), 3, matrixStats::colProds)

Mutate will preserve the original variables. Mutate_all will allow you to multiply all columns in your dataframe.
abc %>%
mutate(new_vary1 = `1.x`* `2.x`,
new_vary2 = `1.y`* `2.y`) %>%
mutate_all(funs(.*`1.x`))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to deal with duplicates using Excel/R? - r

We can use make.unique in R df$Name <- make.unique(df$Name, sep='_')

Use this formula: =A2& IF(COUNTIF($A$2:$A2,A2)=1,"","_"&COUNTIF($A$2:$A2,A2)-1) See below screenshot.

Related

Perform row-wise operation in datatable with multiple elements

How to use Logical indexing and min function to find the row which has min value?

Substraction column to another column of next row in R

Calculate from one cell in first row

multiplying columns in R

Categories

Resources