Rearranging data frame columns in R (mutate, dplyr) - r

I have a data frame like so
Type Number Species
A 1 G
A 2 R
A 7 Q
A 4 L
B 4 S
B 5 T
B 3 H
B 9 P
C 12 K
C 11 T
C 6 U
C 5 Q
Where I have used group_by(Type)
My goal is to collapse this data by having NUMBER be the top 2 values in the number column, and then making a new column(Number_2) that is the second 2 values.
Also I would want the Species values for the bottom two numbers to be deleted, so that the species corresponds to the higher number in the row
I would like to use dplyr and the final would look like this
Type Number Number_2 Species
A 7 1 Q
A 4 2 L
B 5 3 T
B 9 4 P
C 12 6 K
C 11 5 T
as of now the order that number_2 is in doesn't matter, as long as it is in the same type....
I don't know if this is possible but if it is does anyone know how...
thanks!

You can try
library(data.table)
setDT(df1)[order(-Number), list(Number1=Number[1:2],
Number2=Number[3:4],
Species=Species[1:2]), keyby = Type]
# Type Number1 Number2 Species
#1: A 7 2 Q
#2: A 4 1 L
#3: B 9 4 P
#4: B 5 3 T
#5: C 12 6 K
#6: C 11 5 T
Or using dplyr with do
library(dplyr)
df1 %>%
group_by(Type) %>%
arrange(desc(Number)) %>%
do(data.frame(Type=.$Type[1L],
Number1=.$Number[1:2],
Number2 = .$Number[3:4],
Species=.$Species[1:2], stringsAsFactors=FALSE))
# Type Number1 Number2 Species
#1 A 7 2 Q
#2 A 4 1 L
#3 B 9 4 P
#4 B 5 3 T
#5 C 12 6 K
#6 C 11 5 T

Here's a different dplyr approach.
library(dplyr)
# Start creating the data set with top 2 values and store as df1:
df1 <- df %>%
group_by(Type) %>%
top_n(2, Number) %>%
ungroup() %>%
arrange(Type, Number)
# Then, get the anti-joined data (the not top 2 values), arrange, rename and select
# the number colummn and cbind to df1:
out <- df %>%
anti_join(df1, c("Type","Number")) %>%
arrange(Type, Number) %>%
select(Number2 = Number) %>%
cbind(df1, .)
This results in:
> out
# Type Number Species Number2
#1 A 4 L 1
#2 A 7 Q 2
#3 B 5 T 3
#4 B 9 P 4
#5 C 11 T 5
#6 C 12 K 6

This could be another option using ddply
library(plyr)
ddply(dat[order(Number)], .(Type), summarize,
Number1 = Number[4:3], Number2 = Number[2:1], Species = Species[4:3])
# Type Number1 Number2 Species
#1 A 7 2 Q
#2 A 4 1 L
#3 B 9 4 P
#4 B 5 3 T
#5 C 12 6 K
#6 C 11 5 T

Related

How to remove all rows from dataframe if count of simillar `person_id` values are not `== 2`

I need remove all rows from dataframe if count of simillar person_id values are not == 2. For example:
a1 <- data.frame(person_id = 1:5, b=letters[1:5])
a2 <- data.frame(person_id = 2:6, b=letters[6:10])
data = rbind(a1, a2)
person_id b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e
6 2 f
7 3 g
8 4 h
9 5 i
10 6 j
Row 1 and 10 must be removed, because person_id==1 and person_id==6 have only 1 record. For example person_id==2 have 2 rows.
How can I get new dataset with only rows where count of rows with person_id values are == 2 (and in future 3 or 4)?
Base R solution:
subset(
data,
ave(person_id, person_id, FUN = length) == 2
)
To remove the rows where count of person_id isn't equal to 2:
library(dplyr)
data %>%
group_by(person_id) %>%
filter(n() == 2)
person_id b
<int> <chr>
1 2 b
2 3 c
3 4 d
4 5 e
5 2 f
6 3 g
7 4 h
8 5 i

R slide window through tibble

I got a simple question that I cannot figure out solutions.
Also, I didn't find an answer that I understand.
Imagine I got this data frame
(ts <- tibble(
+ a = LETTERS[1:10],
+ b = c(rep(1, 5), rep(2,5))
+ ))
# A tibble: 10 x 2
a b
<chr> <dbl>
1 A 1
2 B 1
3 C 1
4 D 1
5 E 1
6 F 2
7 G 2
8 H 2
9 I 2
10 J 2
What I want is simple. I want to build a df with the column b indexing a sliding window which sizes n f the column a.
The output can be something like this:
# A tibble: 8 x 2
b a
<dbl> <chr>
1 1 A B
2 1 B C
3 1 C D
4 1 D E
5 2 F G
6 2 G H
7 2 H I
8 2 I J
I don't care if the column a contains an array (nest values).
I just need a new data frame based on the sliding window.
Since this operation will run in a relational database I'd like a function compatible with DBI-PostgresSQL.
Any help is appreciated.
Thanks in advance
We can group by 'b', create the new column based on the lead of 'a', remove the NA rows with na.omit
library(dplyr)
ts %>%
group_by(b) %>%
mutate(a2 = lead(a)) %>%
ungroup %>%
na.omit %>%
select(b, everything())
# A tibble: 8 x 3
# b a a2
# <dbl> <chr> <chr>
#1 1 A B
#2 1 B C
#3 1 C D
#4 1 D E
#5 2 F G
#6 2 G H
#7 2 H I
#8 2 I J
If lead doesn't works, then just remove the first element, append NA at the end in the mutate step
ts %>%
group_by(b) %>%
mutate(a2 = c(a[-1], NA)) %>%
ungroup %>%
na.omit %>%
select(b, everything())

Converting a row of data into a data frame in R

I have a single row data frame like this:
X1 X2 X3
1 [['1','2','3'], ['4','6','5'], ['7','8']] ['9','10','11','12','13']
I would like create a new dataframe from that using columns X2 and X3 that looks like this:
ID Group
1 A
2 A
3 A
4 B
5 B
6 B
7 C
8 C
9 D
10 D
11 D
12 D
13 D
So each number in the dataframe is grouped by the square brackets in the orignal dataframe.
Can anyone recommend a good way of doing this in R.
One option would be to split the 'X2' at the , followed by the ], concatenate with 'X3', extract the numeric elements with str_extract_all into a list, stack it to a two column data.frame
library(stringr)
v1 <- c(strsplit(df1$X2, "\\],\\s*")[[1]], df1$X3)
out <- stack(setNames(str_extract_all(v1, "\\d+"), LETTERS[1:4]))
names(out) <- c("ID", "Group")
out
# ID Group
#1 1 A
#2 2 A
#3 3 A
#4 4 B
#5 6 B
#6 5 B
#7 7 C
#8 8 C
#9 9 D
#10 10 D
#11 11 D
#12 12 D
#13 13 D
Or using tidyverse
library(dplyr)
library(tidyr)
df1 %>%
pivot_longer(cols = -X1) %>%
separate_rows(value, sep="(?<=\\]),\\s*") %>%
transmute(Group = LETTERS[row_number()], ID = value) %>%
mutate(ID = str_extract_all(ID, "\\d+")) %>%
unnest(c(ID))
# A tibble: 13 x 2
# Group ID
# <chr> <chr>
# 1 A 1
# 2 A 2
# 3 A 3
# 4 B 4
# 5 B 6
# 6 B 5
# 7 C 7
# 8 C 8
# 9 D 9
#10 D 10
#11 D 11
#12 D 12
#13 D 13
data
df1 <- structure(list(X1 = 1L, X2 = "[['1','2','3'], ['4','6','5'], ['7','8']]",
X3 = "['9','10','11','12','13']"), class = "data.frame", row.names = c(NA,
-1L))

Matching values from multiple columns in 1 data frame to key in second data frame and creating columns

I have 2 data frames. One (df1) looks like this:
var.1 var.2 var.3 var.4
1 7 9 1 2
2 4 6 9 7
3 2 NA NA NA
And the other (df2) looks like this:
var.a var.b var.c var.d
1 1 b c d
2 2 f g h
3 4 j k l
3 7 j k z
...
with all of the values listed out in var.1-var.4 in df1 in var.a of df2.
I want to match var.a from df2 across all of the columns listed in df1 and then add these columns to df1 with new/combined column names. So for instance it'll look like this:
var.1 var1.b var1.c var1.d ... var.4 var4.b var4.c var4.d
1 7 j k z 2 f g h
2 4 j k l 7 j k z
3 2 f g h NA NA NA NA
Thanks in advance!
Here's a tidyverse solution. First, I define the data frames.
df1 <- read.table(text = " var.1 var.2 var.3 var.4
1 7 9 1 2
2 4 6 9 7
3 2 NA NA NA", header = TRUE)
df2 <- read.table(text = " var.a var.b var.c var.d
1 1 b c d
2 2 f g h
3 4 j k l
4 7 j k z", header=TRUE)
Then, I load the libraries.
# Load libraries
library(tidyr)
library(dplyr)
library(tibble)
Finally, I restructure the data.
# Manipulate data
df1 %>%
rownames_to_column() %>%
gather(variable, value, -rowname) %>%
left_join(df2, by = c("value" = "var.a")) %>%
gather(foo, bar, -variable, -rowname) %>%
unite(goop, variable, foo) %>%
spread(goop, bar) %>%
select(-rowname)
#> Warning: attributes are not identical across measure variables;
#> they will be dropped
which gives,
#> var.1_value var.1_var.b var.1_var.c var.1_var.d var.2_value var.2_var.b
#> 1 7 j k z 9 <NA>
#> 2 4 j k l 6 <NA>
#> 3 2 f g h <NA> <NA>
#> var.2_var.c var.2_var.d var.3_value var.3_var.b var.3_var.c var.3_var.d
#> 1 <NA> <NA> 1 b c d
#> 2 <NA> <NA> 9 <NA> <NA> <NA>
#> 3 <NA> <NA> <NA> <NA> <NA> <NA>
#> var.4_value var.4_var.b var.4_var.c var.4_var.d
#> 1 2 f g h
#> 2 7 j k z
#> 3 <NA> <NA> <NA> <NA>
Created on 2019-05-30 by the reprex package (v0.3.0)
This is a little bit convoluted, but I'll try to explain.
I turn row numbers into a column at first, as this will help me put the data back together at the very end.
I go from wide to long format for df1.
I join df2 to df1 based on var.a and var.1 (now called value), respectively.
I go from wide to long again.
I combine the variable names from each data frame into one variable.
Finally, I go from long to wide format (this is where the row numbers come in handy) and drop the row numbers.

get proportion of each group's value counts after group_by dplyr R

In python pandas I was able to simply do df.groupby(x,y).value_counts(normalize=True) to get the proportion of each value in a group. However I've been unable to find a way to do this in R.
I've grouped my df by x and y and have summarized to calculate the frequency,
as such df %>% group_by(x,y) %>% summarize(count=n()) but I'd like to instead see the proportion of each y for each x.
x y count
1 A 22
1 B 65
1 C 94
1 D 40
2 D 34
2 E 1
2 F 6
3 E 4
3 F 13
for example, the new column of proportions should have
x y proportion
1 A 0.0995475
1 B 0.2941176
1 C 0.4253393
1 D 0.1809955
2 D 0.8292683
2 E 0.024390
2 F 0.1463415
3 E 0.2352941
3 F 0.7647059
I think you need to group by x to get the results in your example. Assuming the data frame is named df1:
library(dplyr)
df1 %>%
group_by(x) %>%
mutate(proportion = count/sum(count))
In case, we need a base R option, this can be done with ave
transform(df1, proportion = count/ave(count, x, FUN = sum))[-3]
# x y proportion
#1 1 A 0.09954751
#2 1 B 0.29411765
#3 1 C 0.42533937
#4 1 D 0.18099548
#5 2 D 0.82926829
#6 2 E 0.02439024
#7 2 F 0.14634146
#8 3 E 0.23529412
#9 3 F 0.76470588

Resources