How to remove columns with duplicate values in a data frame? - r

I have the following data:
Years A B C D
2015 1 7 1 13
2016 2 8 2 14
2017 3 9 3 15
2018 4 10 4 16
2019 5 11 5 17
2020 6 12 6 18
I want the result to looks as below (the columns with duplicate values removed):
Years A B D
2015 1 7 13
2016 2 8 14
2017 3 9 15
2018 4 10 16
2019 5 11 17
2020 6 12 18
Thanks in advance for all the help!

Combine the functions unclass and duplicated to find matching columns and then take the others:
df[!duplicated(unclass(df))]
output:
Years A B D
<dbl> <dbl> <dbl> <dbl>
1 2015 1 7 13
2 2016 2 8 14
3 2017 3 9 15
4 2018 4 10 16
5 2019 5 11 17
6 2020 6 12 18

Or we can transpose the dataset and apply the duplicated
df1[!duplicated(t(df1))]
# Years A B D
#1 2015 1 7 13
#2 2016 2 8 14
#3 2017 3 9 15
#4 2018 4 10 16
#5 2019 5 11 17
#6 2020 6 12 18
data
df1 <- structure(list(Years = 2015:2020, A = 1:6, B = 7:12, C = 1:6,
D = 13:18), class = "data.frame", row.names = c(NA, -6L))

If you want something fast, try this approach
df[!duplicated(as.list(df))]
# Years A B D
# 1 2015 1 7 13
# 2 2016 2 8 14
# 3 2017 3 9 15
# 4 2018 4 10 16
# 5 2019 5 11 17
# 6 2020 6 12 18

Related

Convert dataframe from vertical to horizontal

I already checked many questions and I don't seem to find the suitable answer.
I have this df
df = data.frame(x = 1:10,y=11:20)
the output
x y
1 1 11
2 2 12
3 3 13
4 4 14
5 5 15
6 6 16
7 7 17
8 8 18
9 9 19
10 10 20
I just wish the output to be:
1 2 3 4 5 6 7 8 9 10
x 1 2 3 4 5 6 7 8 9 10
y 11 12 13 14 15 16 17 18 19 20
thanks
Try t() like below
> data.frame(t(df), check.names = FALSE)
1 2 3 4 5 6 7 8 9 10
x 1 2 3 4 5 6 7 8 9 10
y 11 12 13 14 15 16 17 18 19 20
A transpose should do it
setNames(data.frame(t(df)), df[,"x"])
1 2 3 4 5 6 7 8 9 10
x 1 2 3 4 5 6 7 8 9 10
y 11 12 13 14 15 16 17 18 19 20

How to update data from column i row 2 to column j row 1 but grouped by two variables (dplyr) in a R dataframe?

I have two columns: sites (3 sites) and month (Jan - Mar) where I sampled in each month. For each month I have corresponding values in column i. I want to copy column i, row 2 to column j row 1. Then assign column j row 3 column i row 1. Repeat this pattern for the rest of the rows for each site. So, if column i went from 1 to 18. Column j would go from 2 3 1 5 6 4 8 9 7 11 12 10 14 15 13 17 18 13. I tried to modify the code from an answer for a similar problem I got here using dplyr. I tried to use the group_by function in dplyr so that it would loop back again, but the function is operating on the entire column.
library(dplyr)
col.site <- c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6)
col.month <- c("Jan","Feb","Mar","Jan","Feb","Mar","Jan","Feb","Mar","Jan","Feb","Mar","Jan","Feb","Mar","Jan","Feb","Mar")
col.i <- c(1:18)
df <- data.frame(col.site,col.month, col.i)
df <- df %>% group_by(col.month,col.site) %>%
mutate(col.j = lead(col.i, default = col.i[1]))
col.j
[1] 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 1
What I expected col.j:
[1] 2 3 1 5 6 4 8 9 7 11 12 10 14 15 13 17 18 13
I think you should only group_by col.site :
library(dplyr)
df %>%
group_by(col.site) %>%
mutate(col.j = lead(col.i, default = first(col.i)))
# col.site col.month col.i col.j
# <dbl> <chr> <int> <int>
# 1 1 Jan 1 2
# 2 1 Feb 2 3
# 3 1 Mar 3 1
# 4 2 Jan 4 5
# 5 2 Feb 5 6
# 6 2 Mar 6 4
# 7 3 Jan 7 8
# 8 3 Feb 8 9
# 9 3 Mar 9 7
#10 4 Jan 10 11
#11 4 Feb 11 12
#12 4 Mar 12 10
#13 5 Jan 13 14
#14 5 Feb 14 15
#15 5 Mar 15 13
#16 6 Jan 16 17
#17 6 Feb 17 18
#18 6 Mar 18 16
Using data.table
library(data.table)
setDT(df)[, col.j := shift(col.i, type = 'lead', fill = first(col.i)), col.site]
Or using dplyr
library(dplyr)
df %>%
group_by(col.site) %>%
mutate(col.j = c(col.i[-1], col.i[1]))
-output
# col.site col.month col.i col.j
# <dbl> <chr> <int> <int>
# 1 1 Jan 1 2
# 2 1 Feb 2 3
# 3 1 Mar 3 1
# 4 2 Jan 4 5
# 5 2 Feb 5 6
# 6 2 Mar 6 4
# 7 3 Jan 7 8
# 8 3 Feb 8 9
# 9 3 Mar 9 7
#10 4 Jan 10 11
#11 4 Feb 11 12
#12 4 Mar 12 10
#13 5 Jan 13 14
#14 5 Feb 14 15
#15 5 Mar 15 13
#16 6 Jan 16 17
#17 6 Feb 17 18
#18 6 Mar 18 16

SMA for multiple items in the same column

I'm trying to create SMA formula for multiple items in the same column. Here's an example of the data I'm working with.
Person Time Value
<chr> <dbl> <dbl>
1 A 1 14
2 A 2 13
3 A 3 17
4 A 4 9
5 A 5 20
6 A 6 5
7 B 1 17
8 B 2 11
9 B 3 18
10 B 4 10
11 B 5 10
12 B 6 20
13 C 1 5
14 C 2 5
15 C 3 11
16 C 4 12
17 C 5 12
18 C 6 9
What I'd like to be able to do is to create another column with the SMA formula for each person (A,B,C, etc.). In this case let's say SMA2. While it works for Person A, I can't get the formula to restart at Person B. Rather Person B's first SMA2 value has Person A's values with it.
Right now I've used this which does give me the SMA I want, just not restarted at each new person:
DataSet$SMA2<-SMA(DataSet$Value, 2)
Any help would be appreciated.
DataSet <- DataSet %>%
group_by(Person) %>%
mutate(sma2 = TTR::SMA(Value,2))
Still came up with this:
# A tibble: 18 x 4
# Groups: Person [3]
Person Time Value sma2
<chr> <dbl> <dbl> <dbl>
1 A 1 14 NA
2 A 2 13 13.5
3 A 3 17 15
4 A 4 9 13
5 A 5 20 14.5
6 A 6 5 12.5
7 B 1 17 11
8 B 2 11 14
9 B 3 18 14.5
10 B 4 10 14
11 B 5 10 10
12 B 6 20 15
13 C 1 5 12.5
14 C 2 5 5
15 C 3 11 8
16 C 4 12 11.5
17 C 5 12 12
18 C 6 9 10.5
Using dplyr, group_by person then use mutate. This will reset the calculations per person.
DataSet <- DataSet %>%
group_by(Person) %>%
mutate(sma2 = TTR::SMA(Value, 2))
# A tibble: 18 x 4
# Groups: Person [3]
Person Time Value sma2
<chr> <int> <int> <dbl>
1 A 1 14 NA
2 A 2 13 13.5
3 A 3 17 15
4 A 4 9 13
5 A 5 20 14.5
6 A 6 5 12.5
7 B 1 17 NA
8 B 2 11 14
9 B 3 18 14.5
10 B 4 10 14
11 B 5 10 10
12 B 6 20 15
13 C 1 5 NA
14 C 2 5 5
15 C 3 11 8
16 C 4 12 11.5
17 C 5 12 12
18 C 6 9 10.5

How to "extrapolate" values of panel data in R?

I have a panel data with NA values like below:
uid year month day value
1 1 2016 8 1 NA
2 1 2016 8 2 NA
3 1 2016 8 3 30
4 1 2016 8 4 NA
5 1 2016 8 5 20
6 2 2016 8 1 40
7 2 2016 8 2 NA
8 2 2016 8 3 50
9 2 2016 8 4 NA
10 2 2016 8 5 NA
I would like to perform a linear interpolation, so I wrote this code:
library(dplyr)
library(zoo)
panel_df <- group_by(panel_df, userid)
panel_df <- mutate(panel_df, value=na.approx(value, na.rm=FALSE))
then I get the output:
uid year month day value
1 1 2016 8 1 NA
2 1 2016 8 2 NA
3 1 2016 8 3 30
4 1 2016 8 4 25
5 1 2016 8 5 20
6 2 2016 8 1 40
7 2 2016 8 2 45
8 2 2016 8 3 50
9 2 2016 8 4 NA
10 2 2016 8 5 NA
Here the approx method interpolates NA values successfully but does not extrapolate.
Is there any good way to replace the value of the 1st and 2nd rows with first non-NA value of this user(30)? Similary, how I can replace the value of the 9th and 10th rows with last non-NA value of this user(50)?
One way to do this is by using na.spline() from same package zoo:
panel_df <- group_by(panel_df, uid)
panel_df <- mutate(panel_df, value=na.spline(value))
panel_df
Source: local data frame [10 x 5]
Groups: uid [2]
uid year month day value
<int> <int> <int> <int> <dbl>
1 1 2016 8 1 40
2 1 2016 8 2 35
3 1 2016 8 3 30
4 1 2016 8 4 25
5 1 2016 8 5 20
6 2 2016 8 1 40
7 2 2016 8 2 45
8 2 2016 8 3 50
9 2 2016 8 4 55
10 2 2016 8 5 60

divide dataframe into subgroups based on several columns successively in R

I have to sort a datapool with following structure into subgroups based on the value of 3 columns in R, but I cannot figure it out.
What I want to do is:
First, sort the datapool based on the column V1, the datapool should be divided into three subgroups according to the value of V1 (the value of V1 should be sorted by descending at first).
Sort each of the 3 subgroups into another 3 subgroups according to the value of V2, now we should have 9 subgroups.
Similarly, subdivide each of the 9 groups into 3 groups again,and resulting in 27 subgroups all together.
the following data is only a simple example, the data have 1545 firms.
Firm value V1 V2 V3
1 7 7 11 8
2 9 9 11 7
3 8 14 8 10
4 9 9 7 14
5 8 11 15 14
6 9 10 9 7
7 8 8 6 14
8 4 8 11 14
9 8 10 13 10
10 2 11 6 13
11 3 5 12 14
12 5 12 15 12
13 1 9 13 7
14 4 5 14 7
15 5 10 5 9
16 5 8 13 14
17 2 10 10 7
18 5 12 12 9
19 7 6 11 7
20 6 9 14 14
21 6 14 9 14
22 8 6 6 7
23 9 11 9 5
24 7 7 6 9
25 10 5 15 11
26 4 6 10 9
27 4 13 14 8
And the result should be:
Firm value V1 V2 V3
5 8 11 15 14
12 5 12 15 12
27 4 13 14 8
21 6 14 9 14
18 5 12 12 9
23 9 11 9 5
10 2 11 6 13
3 8 14 8 10
6 9 10 9 7
20 6 9 14 14
9 8 10 13 10
13 1 9 13 7
8 4 8 11 14
2 9 9 11 7
17 2 10 10 7
4 9 9 7 14
7 8 8 6 14
15 5 10 5 9
16 5 8 13 14
25 10 5 15 11
14 4 5 14 7
11 3 5 12 14
1 7 7 11 8
19 7 6 11 7
26 4 6 10 9
24 7 7 6 9
22 8 6 6 7
I have tried for a long time, also searched Google without success. :(
As #Codoremifa said, data.table can be used here:
require(data.table)
DT <- data.table(dat)
DT[order(V1),G1:=rep(1:3,each=9)]
DT[order(V2),G2:=rep(1:3,each=3),by=G1]
DT[order(V3),G3:=1:3,by='G1,G2']
Now your groups are labeled using the additional columns G1 and G2. To sort, so that it's easier to see the groups, use
setkey(DT,G1,G2,G3)
A couple of the OP's columns are just noise unrelated to the question; to verify that this works by eye, try DT[,list(V1,V2,V3,G1,G2,G3)]
EDIT: The OP did not specify a means of dealing with ties. I guess it makes sense to use the value in the later columns to break ties, so...
DT <- data.table(dat)
DT[order(rank(V1)+rank(V2)/100+rank(V3)/100^2),
G1:=rep(1:3,each=9)]
DT[order(rank(V2)+rank(V3)/100),
G2:=rep(1:3,each=3),by=G1]
DT[order(V3),
G3:=1:3,by='G1,G2']
setkey(DT,G1,G2,G3)
DT[27:1] (the result backwards) is
Firm value V1 V2 V3 G1 G2 G3
1: 5 8 11 15 14 3 3 3
2: 12 5 12 15 12 3 3 2
3: 27 4 13 14 8 3 3 1
4: 21 6 14 9 14 3 2 3
5: 9 8 10 13 10 3 2 2
6: 18 5 12 12 9 3 2 1
7: 10 2 11 6 13 3 1 3
8: 3 8 14 8 10 3 1 2
9: 23 9 11 9 5 3 1 1
10: 20 6 9 14 14 2 3 3
11: 16 5 8 13 14 2 3 2
12: 13 1 9 13 7 2 3 1
13: 8 4 8 11 14 2 2 3
14: 17 2 10 10 7 2 2 2
15: 2 9 9 11 7 2 2 1
16: 4 9 9 7 14 2 1 3
17: 15 5 10 5 9 2 1 2
18: 6 9 10 9 7 2 1 1
19: 11 3 5 12 14 1 3 3
20: 25 10 5 15 11 1 3 2
21: 14 4 5 14 7 1 3 1
22: 26 4 6 10 9 1 2 3
23: 1 7 7 11 8 1 2 2
24: 19 7 6 11 7 1 2 1
25: 7 8 8 6 14 1 1 3
26: 24 7 7 6 9 1 1 2
27: 22 8 6 6 7 1 1 1
Firm value V1 V2 V3 G1 G2 G3
Here is an answer using transform and then ddply from plyr. I don't address the ties, which really means that in case of a tie the value from the lowest row number is used first. This is what the OP shows in the example output.
First, order the dataset in descending order of V1 and create three groups of 9 by creating a new variable, fv1.
dat1 = transform(dat1[order(-dat1$V1),], fv1 = factor(rep(1:3, each = 9)))
Then order the dataset in descending order of V2 and create three groups of 3 within each level of fv1.
require(plyr)
dat1 = ddply(dat1[order(-dat1$V2),], .(fv1), transform, fv2 = factor(rep(1:3, each = 3)))
Finally order the dataset by the two factors and V3. I use arrange from plyr for typing efficiency compared to order
(finaldat = arrange(dat1, fv1, fv2, -V3) )
This isn't a particularly generalizable answer, as the group sizes are known in advance for the factors. If the V3 group size was larger than one, a similar process as for V2 would be needed.

Resources