R create new column of the rank of factors - r

In R I have a dataframe with two columns one is a value and the other is the group that each value is assigned to:
my_group my_value
A 1.2
B 5.4
C 9.2
A 1.1
B 5.2
C 9.8
A 1.3
B 5.1
C 9.2
A 1.0
B 5.7
C 9.1
I want to create a third column that uses the average of my_value by group to rank the groups and enters that rank in each row:
my_group my_value my_group_rank
A 1.2 3
B 5.4 2
C 9.2 1
A 1.1 3
B 5.2 2
C 9.8 1
A 1.3 3
B 5.1 2
C 9.2 1
A 1.0 3
B 5.7 2
C 9.1 1

The following code will add the group ranks to your data, except that the ranks will be in opposite order, perhaps you can still use it. I use the package dplyr for this. In my example, I assume your data is in a data.frame called test.
require(dplyr)
test <- test %>%
group_by(my_group) %>%
mutate(avg = mean(my_value)) %>%
ungroup() %>%
mutate(my_group_rank = dense_rank(avg)) %>%
select(-avg)
# my_group my_value my_group_rank
#1 A 1.2 1
#2 B 5.4 2
#3 C 10.2 3
#4 A 1.1 1
#5 B 5.2 2
#6 C 9.8 3
#7 A 1.3 1
#8 B 5.1 2
#9 C 9.2 3
#10 A 1.0 1
#11 B 5.7 2
#12 C 10.1 3

Related

Extract rows with duplicated values in one column only if corresponding values in another column are also duplicated in r

I am trying to extract rows ONLY IF they have duplicated values in the first and second columns (x1 and x2). In other words, extracting duplicated rows in the first column(x1) ONLY IF the corresponding rows in the second column (x2) are all duplicates.
dt
x1
x2
x3
1
a
2.1
1
a
3.4
1
b
4
2
c
5.5
2
c
4.1
2
d
5
3
e
2.4
3
e
7
4
f
1.5
4
f
4.4
4
f
2.1
5
g
7.8
I tried to use:
dupe = dt[,c('x1','x2')]
dt[duplicated(dupe) | duplicated(dupe, fromLast=TRUE),]
However the results are different to what I want. My desired database SHOULD EXCLUDE X1=1 because corresponding x2 a=a≠b, the same applies for x1=2 (corresponding x2 are not ALL duplicates c=c≠d).
x1
x2
x3
1
a
2.1
1
a
3.4
2
c
5.5
2
c
4.1
3
e
2.4
3
e
7
4
f
1.5
4
f
4.4
4
f
2.2
My DESIRED database should include the following:
x1
x2
x3
3
e
2.4
3
e
7
4
f
1.5
4
f
4.4
4
f
2.2
Any solutions please?
df[duplicated(df[-3])|duplicated(df[-3], fromLast = TRUE), ]
x1 x2 x3
1 1 a 2.1
2 1 a 3.4
4 2 c 5.5
5 2 c 4.1
7 3 e 2.4
8 3 e 7.0
9 4 f 1.5
10 4 f 4.4
11 4 f 2.1

Adding a column that contain te missing values of an specific column of a tibble in R

I am working with R. I got the missing values of a specific column in a dataset and I need add them into my main data.
My data looks like this...
A B C D G
Joseph 5 2.1 6.0 7.8
Juan NA 3.0 3.5 3.8
Miguel 2 4.0 2.0 2.5
Steven NA 6.0 5.0 0.2
Jennifer NA 0.1 5.0 7.0
Emma 8.0 8.1 8.3 8.5
So, no I have the data of the missing values in column B
A B
Juan 3.0
Steven 2.5
Jennifer 4.4
I need to add them into my main data. I tried to use the function coalesce that it is within tidyverse, but I wasn't able to get the right result.
One option could be:
df %>%
mutate(B = if_else(is.na(B), df2$B[match(A, df2$A)], B))
A B C D G
1 Joseph 5.0 2.1 6.0 7.8
2 Juan 3.0 3.0 3.5 3.8
3 Miguel 2.0 4.0 2.0 2.5
4 Steven 2.5 6.0 5.0 0.2
5 Jennifer 4.4 0.1 5.0 7.0
6 Emma 8.0 8.1 8.3 8.5
Does this work:
df
# A tibble: 6 x 5
A B C D G
<chr> <dbl> <dbl> <dbl> <dbl>
1 Joseph 5 2.1 6 7.8
2 Juan NA 3 3.5 3.8
3 Miguel 2 4 2 2.5
4 Steven NA 6 5 0.2
5 Jennifer NA 0.1 5 7
6 Emma 8 8.1 8.3 8.5
dd
# A tibble: 3 x 2
A B
<chr> <dbl>
1 Juan 3
2 Steven 2.5
3 Jennifer 4.4
df$B[match(dd$A,df$A)] <- dd$B
df
# A tibble: 6 x 5
A B C D G
<chr> <dbl> <dbl> <dbl> <dbl>
1 Joseph 5 2.1 6 7.8
2 Juan 3 3 3.5 3.8
3 Miguel 2 4 2 2.5
4 Steven 2.5 6 5 0.2
5 Jennifer 4.4 0.1 5 7
6 Emma 8 8.1 8.3 8.5
You can join the two dataframe and use coalesce for B values.
library(dplyr)
df1 %>%
left_join(df2, by = 'A') %>%
mutate(B = coalesce(B.x, B.y)) %>%
select(names(df1))
# A B C D G
#1 Joseph 5.0 2.1 6.0 7.8
#2 Juan 3.0 3.0 3.5 3.8
#3 Miguel 2.0 4.0 2.0 2.5
#4 Steven 2.5 6.0 5.0 0.2
#5 Jennifer 4.4 0.1 5.0 7.0
#6 Emma 8.0 8.1 8.3 8.5
Or in base R :
transform(merge(df1, df2, all.x = TRUE, by = 'A'),
B = ifelse(is.na(B.x), B.y, B.x))[names(df1)]
You can to join the data and then apply the value for NA value on column B.
# your original data with missing value in column B
data
# data that contain data to fill into column B
additional_data
library(dplyr)
merged_data <- left_join(data, additional_data, by = "A",
suffix = c("", "_additional"))
merged_data %>% mutate(B = if_else(is_na(B), B_additional, B)) %>%
select(-B_additional)

Pandas equivalent of dplyr everything()

In R I frequently use dplyr's select in combination with everything()
df %>% select(var4, var17, everything())
The above for example would reorder the columns of the dataframe, such that var4 is the first, var17 is the second and subsequently all remaining columns are listed. What is the most pandathonic way of doing this? Working with many columns makes explicitly spelling them out a pain as well as keeping track of their position.
The ideal solution is short, readable and can be used in pandas chaining.
Use Index.difference for all values without specified in list and join together:
df = pd.DataFrame({
'G':list('abcdef'),
'var17':[4,5,4,5,5,4],
'A':[7,8,9,4,2,3],
'var4':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
cols = ['var4','var17']
another = df.columns.difference(cols, sort=False).tolist()
df = df[cols + another]
print (df)
var4 var17 G A E F
0 1 4 a 7 5 a
1 3 5 b 8 3 a
2 5 4 c 9 6 a
3 7 5 d 4 9 b
4 1 5 e 2 2 b
5 0 4 f 3 4 b
EDIT: For chaining is possible use DataFrame.pipe with passed DataFrame:
def everything_after(df, cols):
another = df.columns.difference(cols, sort=False).tolist()
return df[cols + another]
df = df.pipe(everything_after, ['var4','var17']))
print (df)
var4 var17 G A E F
0 1 4 a 7 5 a
1 3 5 b 8 3 a
2 5 4 c 9 6 a
3 7 5 d 4 9 b
4 1 5 e 2 2 b
5 0 4 f 3 4 b
Now how smoothly you can do it with datar!
>>> from datar import f
>>> from datar.datasets import iris
>>> from datar.dplyr import select, everything, slice_head
>>> iris >> slice_head(5)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
>>> iris >> select(f.Species, everything()) >> slice_head(5)
Species Sepal_Length Sepal_Width Petal_Length Petal_Width
0 setosa 5.1 3.5 1.4 0.2
1 setosa 4.9 3.0 1.4 0.2
2 setosa 4.7 3.2 1.3 0.2
3 setosa 4.6 3.1 1.5 0.2
4 setosa 5.0 3.6 1.4 0.2
I am the author of the package. Feel free to submit issues if you have any questions.

lagging variables by day and creating new row in the process

I'm trying to lag variables by day but many don't have an observation on the previous day. So I need to add an extra row in the process. Dplyr gets me close but I need a way to add a new row in the process and have many thousands of cases. Any thoughts would be much appreciated.
ID<-c(1,1,1,1,2,2)
day<-c(0,1,2,5,1,3)
v<-c(2.2,3.4,1.2,.8,6.4,2)
dat1<-as.data.frame(cbind(ID,day,v))
dat1
ID day v
1 1 0 2.2
2 1 1 3.4
3 1 2 1.2
4 1 5 0.8
5 2 1 6.4
6 2 3 2.0
Using dplyr gets me here:
dat2<-
dat1 %>%
group_by(ID) %>%
mutate(v.L = dplyr::lead(v, n = 1, default = NA))
dat2
ID day v v.L
1 1 0 2.2 3.4
2 1 1 3.4 1.2
3 1 2 1.2 0.8
4 1 5 0.8 NA
5 2 1 6.4 2.0
6 2 3 2.0 NA
But I need to get here:
ID2<-c(1,1,1,1,1,2,2,2)
day2<-c(0,1,2,4,5,1,2,3)
v2<-c(2.2,3.4,1.2,NA,.8,6.4,NA,2)
v2.L<-c(3.4,1.2,NA,.8,NA,NA,2,NA)
dat3<-as.data.frame(cbind(ID2,day2,v2,v2.L))
dat3
ID2 day2 v2 v2.L
1 1 0 2.2 3.4
2 1 1 3.4 1.2
3 1 2 1.2 NA
4 1 4 NA 0.8
5 1 5 0.8 NA
6 2 1 6.4 NA
7 2 2 NA 2.0
8 2 3 2.0 NA
You could use complete and full_seq from the tidyr package to complete the sequence of days. You'd need to remove at the end the rows that have NA in both v and v.L:
library(dplyr)
library(tidyr)
dat2 = dat1 %>%
group_by(ID) %>%
complete(day = full_seq(day,1)) %>%
mutate(v.L = lead(v)) %>%
filter(!(is.na(v) & is.na(v.L)))
ID day v v.L
<dbl> <dbl> <dbl> <dbl>
1 0 2.2 3.4
1 1 3.4 1.2
1 2 1.2 NA
1 4 NA 0.8
1 5 0.8 NA
2 1 6.4 NA
2 2 NA 2.0
2 3 2.0 NA

R - Count duplicated rows keeping index of their first occurrences

I have been looking for an efficient way of counting and removing duplicate rows in a data frame while keeping the index of their first occurrences.
For example, if I have a data frame:
df<-data.frame(x=c(9.3,5.1,0.6,0.6,8.5,1.3,1.3,10.8),y=c(2.4,7.1,4.2,4.2,3.2,8.1,8.1,5.9))
ddply(df,names(df),nrow)
gives me
x y V1
1 0.6 4.2 2
2 1.3 8.1 2
3 5.1 7.1 1
4 8.5 3.2 1
5 9.3 2.4 1
6 10.8 5.9 1
But I want to keep the original indices (along with the row names) of the duplicated rows. like:
x y V1
1 9.3 2.4 1
2 5.1 7.1 1
3 0.6 4.2 2
5 8.5 3.2 1
6 1.3 8.1 2
8 10.8 5.9 1
"duplicated" returns the original rownames (here {1 2 3 5 6 8}) but doesnt count the number of occurences. I tried writing functions on my own but none of them are efficient enough to handle big data. My data frame can have up to couple of million rows (though columns are usually 5 to 10).
If you want to keep the index:
library(data.table)
setDT(df)[,.(.I, .N), by = names(df)][!duplicated(df)]
# x y I N
#1: 9.3 2.4 1 1
#2: 5.1 7.1 2 1
#3: 0.6 4.2 3 2
#4: 8.5 3.2 5 1
#5: 1.3 8.1 6 2
#6: 10.8 5.9 8 1
Or using data.tables unique method
unique(setDT(df)[,.(.I, .N), by = names(df)], by = names(df))
We can try with data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), grouped by 'x', 'y' column, we get the nrow (.N ).
library(data.table)
setDT(df)[, list(V1=.N), by = .(x,y)]
# x y V1
#1: 9.3 2.4 1
#2: 5.1 7.1 1
#3: 0.6 4.2 2
#4: 8.5 3.2 1
#5: 1.3 8.1 2
#6: 10.8 5.9 1
If we need the row ids,
setDT(df)[, list(V1= .N, rn=.I[1L]), by = .(x,y)]
# x y V1 rn
#1: 9.3 2.4 1 1
#2: 5.1 7.1 1 2
#3: 0.6 4.2 2 3
#4: 8.5 3.2 1 5
#5: 1.3 8.1 2 6
#6: 10.8 5.9 1 8
Or
setDT(df, keep.rownames=TRUE)[, list(V1=.N, rn[1L]), .(x,y)]

Resources