Spread (tidyr) - Spreading repeated values - r

Given this data:
x <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 4)
y <- c('Name', 'Street', 'Gender', 'Name', 'Street', 'Name', 'Street', 'Street', 'Dateofbirth', 'Gender','Name')
z <- c('Jasper', 'Broadway', 'Male', 'Alice', 'Narrowstreet', 'Peter', 'Neverland', 'Treasureisland', '1841', 'Male','Martin')
k <- data.frame(id = x, key = y, value = z)
I would like to create a clean 4-column table that has has keys as headers (i.e. Name, Street, Gender and Date of birth). The problem here is that the key 'Street' is double for Peter. I've tried to use spread (tidyr) but I haven't managed to make it work.
k <- k %>% group_by(id) %>%
mutate(index = row_number()) %>%
spread(key, value)
I also gave a shot to:
k <- k %>% group_by(id) %>%
mutate(index = row_number()) %>%
spread(id, value)
The result is not what I was expecting and both tables are quite difficult to work with. Any ideas?

Don't know if this is exactly what you are looking for, but if you just want to keep the first, you can group_by(id,key) and summarise value using first. Then, regroup by id and spread:
library(dplyr)
library(tidyr)
k <- k %>% group_by(id, key) %>% summarise(value=first(value)) %>% group_by(id) %>% spread(key,value)
##Source: local data frame [4 x 5]
##Groups: id [4]
##
## id Dateofbirth Gender Name Street
##* <dbl> <fctr> <fctr> <fctr> <fctr>
##1 1 NA Male Jasper Broadway
##2 2 NA NA Alice Narrowstreet
##3 3 1841 Male Peter Neverland
##4 4 NA NA Martin NA
To put the doubled values in separate columns, use make.names to create unique keys:
k <- k %>% group_by(id) %>% mutate(key=make.names(key,unique=TRUE)) %>% group_by(id) %>% spread(key,value)
##Source: local data frame [4 x 6]
##Groups: id [4]
##
## id Dateofbirth Gender Name Street Street.1
##* <dbl> <fctr> <fctr> <fctr> <fctr> <fctr>
##1 1 NA Male Jasper Broadway NA
##2 2 NA NA Alice Narrowstreet NA
##3 3 1841 Male Peter Neverland Treasureisland
##4 4 NA NA Martin NA NA
Alternatively, you can group_by(id,key) and summarise value using toString or paste with collapse to flatten the doubled values:
k <- k %>% group_by(id, key) %>% summarise(value=toString(value)) %>% group_by(id) %>% spread(key,value)
##Source: local data frame [4 x 5]
##Groups: id [4]
##
## id Dateofbirth Gender Name Street
##* <dbl> <chr> <chr> <chr> <chr>
##1 1 <NA> Male Jasper Broadway
##2 2 <NA> <NA> Alice Narrowstreet
##3 3 1841 Male Peter Neverland, Treasureisland
##4 4 <NA> <NA> Martin <NA>

Related

extract valus of another dataframe if value of one column is partially match in R

Sorry I didn't clarify my question,
my aim is if dt$id %in% df$id , extract df$score add to new column at dt,
I have a dataframe like this :
df <- tibble(
score = c(2587,002,885,901,2587,3371,3372,002),
id = c("AR01.0","AR01.1","AR01.12","ERS02.00","ERS02.01","ERS02.02","QR01","QR01.03"))
And I have another dataframe like
dt <- tibble(
id = c("AR01","QR01","KVC"),
city = c("AM", "Bis","CHB"))
I want to mutate a new column "score"
I want to got output like below :
id
city
score
AR01
AM
2587/2/885
ERS02
Bis
901/3371
KVC
CHB
NA
or
id
city
score
score2
score3
AR01
AM
2587
2
885
ERS02
Bis
901
3371
NA
KVC
CHB
NA
NA
NA
I tried to use ifelse to achieve but always got error,
do any one can provide ideas? Thank you.
A simple left_join (after mutateing id values in df) is required:
library(dplyr)
library(stringr)
left_join(df %>% mutate(id = str_extract(id, "[\\w]+")), dt, by = "id") %>%
group_by(id) %>%
summarise(across(city,first),
score = paste(score, collapse = "/"))
# A tibble: 3 × 3
id city score
<chr> <chr> <chr>
1 AR01 AM 2587/2/885
2 ERS02 NA 901/2587/3371
3 QR01 Bis 3372/2
For the second solution you can use separate:
library(dyplr)
library(stringr)
library(tidyr)
left_join(df %>% mutate(id = str_extract(id, "[\\w]+")), dt, by = "id") %>%
group_by(id) %>%
summarise(across(city,first),
score = paste(score, collapse = "/")) %>%
separate(score,
into = paste("score", 1:3),
sep = "/" )
# A tibble: 3 × 5
id city `score 1` `score 2` `score 3`
<chr> <chr> <chr> <chr> <chr>
1 AR01 AM 2587 2 885
2 ERS02 NA 901 2587 3371
3 QR01 Bis 3372 2 NA
You could create groups by extracting everything before the . using sub to group_by on and merge the rows with paste separated with / and right_join them by id like this:
library(tibble)
df <- tibble(
score = c(2587,002,885,901,2587,3371,3372,002),
id = c("AR01.0","AR01.1","AR01.12","ERS02.00","ERS02.01","ERS02.02","QR01","QR01.03"))
dt <- tibble(
id = c("AR01","QR01","KVC"),
city = c("AM", "Bis","CHB"))
library(dplyr)
df %>%
mutate(id = sub('\\..*', "", id)) %>%
group_by(id) %>%
mutate(score = paste(score, collapse = '/')) %>%
distinct(id, .keep_all = TRUE) %>%
ungroup() %>%
right_join(., dt, by = 'id')
#> # A tibble: 3 × 3
#> score id city
#> <chr> <chr> <chr>
#> 1 2587/2/885 AR01 AM
#> 2 3372/2 QR01 Bis
#> 3 <NA> KVC CHB
Created on 2022-10-01 with reprex v2.0.2

is there an R code for the following data wrangling and transformation

I have the following data set
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007,001,002,003,004,005,006,007,008,009,010,011,012,013,014,015,016,017,018,019,020,021,022,023,024,025,026,027,028,029)
dat1<-data.frame(id,s02)
I would wish to create a data set based on this dat1. I would wish to have an R code that creates n s02 automatically as s02__0, s02__1, s02__2, s02__3, s02__4, in which case my n==5. Then based on the ID in dat1, the code should allocate each s02 to the respective s02__0 to s02__4 in the data frame. These rows are uniquely identified by another ID_2 created based on the number of rows. If incase the s02 are less in the row created, then the remaining cells should be allocated ##N/A##. if the s02 are more than the n, then another new row with an increment from the unique ID_2 is formed to accommodate the extra s02 and every blank cell is still filled with ##N/A##.
From the dataset above, I would wish to have the following output
id<-c(1,2,3,3,4,4,4,4,4,4)
id_2<-c(1,1,1,2,1,2,3,4,5,6)
s02__0<-c(1,1,1,6,1,6,11,16,21,26)
s02__1<-c(2,2,2,7,2,7,12,17,22,27)
s02__2<-c(3,3,3,##N/A##,3,8,13,18,23,28)
s02__3<-c(4,4,4,##N/A##,4,9,14,19,24,29)
s02__4<-c(##N/A##,5,5,##N/A##,5,10,15,20,25,##N/A##)
dat2<-data.frame(id,id_2,s02__0,s02__1,s02__2,s02__3,s02__4)
This can produce what you want:
library(tidyverse)
#Data
id<-c(1,1,1,1,2,2,2,2,2,3,3,3,3,3,3,3)
s02<-c(001,002,003,004,001,002,003,004,005,001,002,003,004,005,006,007)
dat1<-data.frame(id,s02)
#Code
dat2 <- dat1 %>% group_by(id) %>% mutate(id2 = ifelse(s02<=5,1,2)) %>% ungroup() %>%
group_by(id,id2) %>% mutate(val=1:n()-1,nid = cur_group_id()) %>% ungroup() %>%
select(-id2) %>% mutate(id=paste0(id,'.',nid),val=paste0('s02','.',val)) %>% select(-nid) %>%
pivot_wider(names_from = c(val),values_from = s02) %>%
mutate(id=gsub("\\..*","", id)) %>% group_by(id) %>%
mutate(id2=1:n()) %>% select(order(colnames(.)))
dat2
# A tibble: 4 x 7
# Groups: id [3]
id id2 s02.0 s02.1 s02.2 s02.3 s02.4
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 2 3 4 NA
2 2 1 1 2 3 4 5
3 3 1 1 2 3 4 5
4 3 2 6 7 NA NA NA

Count string length using external table

Suppose you have a table of data:
df<-tibble(person = c("Alice", "Bob", "Mary"),
colour = c("Red", "Green", "Blue"),
city = c("London", "Paris", "New York"))
# A tibble: 3 x 3
person colour city
<chr> <chr> <chr>
1 Alice Red London
2 Bob Green Paris
3 Mary Blue New York
And a second table which contains the field names and the maximum string length of each field:
len<-tibble(field_name = c("person", "colour", "city"),
field_length = c(12, 4, 6))
# A tibble: 3 x 2
field_name field_length
<chr> <dbl>
1 person 12
2 colour 4
3 city 6
How can I check, for each field in len, whether a string in df is less than or equal to len$field_length, returning rows which fail the test?
As an example:
Output Row 1 in df would pass because:
'Alice' <= 12 characters long,
'Red' is <= 4 characters long and
'London' is <= 6 characters long.
However,
Row 2 would fail because:
'Green' > 4 characters long and
Row 3 would fail because:
'New York' > 6 characters long.
Thus the returned data frame should only display Rows 2 and Row 3 of the original df.
A dplyr solution with c_across():
library(dplyr)
df %>%
rowwise() %>%
filter(any(nchar(c_across(everything())) > len$field_length)) %>%
ungroup()
# # A tibble: 2 x 3
# person colour city
# <chr> <chr> <chr>
# 1 Bob Green Paris
# 2 Mary Blue New York
Using base R with mapply :
df[rowSums(mapply(function(x, y) nchar(x) > y, df, len$field_length)) > 0, ]
# A tibble: 2 x 3
# person colour city
# <chr> <chr> <chr>
#1 Bob Green Paris
#2 Mary Blue New York
If column names in df are not in the same order as len$field_name use df[len$field_name] in mapply.
In tidyverse we can get data in long format join it with len data by column name, select the rows which fail and get data in wide format again.
library(dplyr)
library(tidyr)
df %>%
mutate(row = row_number()) %>%
pivot_longer(-row) %>%
left_join(len, by = c('name' = 'field_name')) %>%
group_by(row) %>%
filter(any(nchar(value) > field_length)) %>%
dplyr::select(-field_length) %>%
pivot_wider()
It's easier to solve your problem in terms of 2 matrices, first the length of each of your entries:
nchar(as.matrix(df))
person colour city
[1,] 5 3 6
[2,] 3 5 5
[3,] 4 4 8
And a corresponding matrix of allowed length:
allowed = replicate(nrow(df),len$field_length[match(colnames(df),len$field_name)])
allowed
[,1] [,2] [,3]
[1,] 12 12 12
[2,] 4 4 4
[3,] 6 6 6
Then matrix wise comparison, and only keep those where the rowSums() are
df[rowMeans(nchar(as.matrix(df)) > allowed)>0,]
# A tibble: 2 x 3
person colour city
<chr> <chr> <chr>
1 Bob Green Paris
2 Mary Blue New York
If your two data.frames are already in the same order like your example, you can do (thanks to #zx8754) for pointing it out:
df[rowMeans(nchar(as.matrix(df)) > len$field_length)>0,]
# A tibble: 2 x 3
person colour city
<chr> <chr> <chr>
1 Bob Green Paris
2 Mary Blue New York
Pivot df into the same format as len and join the two. After this, it is trivial to compare each string to the field_length.
library(tidyverse)
test_result_df <- df %>%
mutate(id = row_number()) %>%
pivot_longer(-id, names_to = 'field_name') %>%
left_join(len, by = 'field_name') %>%
mutate(test_passed = str_length(value) <= field_length) %>%
group_by(id) %>%
summarise(all_passed = all(test_passed))
df[!test_result_df$all_passed,]
# A tibble: 2 x 3
person colour city
<chr> <chr> <chr>
1 Bob Green Paris
2 Mary Blue New York

How can I match two sets of factor levels in a new data frame?

I have a large data frame and I want to export a new data frame that contains summary statistics of the first based on the id column.
library(tidyverse)
set.seed(123)
id = rep(c(letters[1:5]), 2)
species = c("dog","dog","cat","cat","bird","bird","cat","cat","bee","bee")
study = rep("UK",10)
freq = rpois(10, lambda=12)
df1 <- data.frame(id,species, freq,study)
df1$id<-sort(df1$id)
df1
df2 <- df1 %>% group_by(id) %>%
summarise(meanFreq= mean(freq),minFreq=min(freq))
df2
I want to keep the species name in the new data frame with the summary statistics. But if I merge by id I get redundant rows. I should only have one row per id but with the species name appended.
df3<-merge(df2,df1,by = "id")
This is what it should look like but my real data is messier than this neat set up here:
df4 = df3[seq(1, nrow(df3), 2), ]
df4
From the summarised output ('df2') we can join with the distinct rows of the selected columns of original data
library(dplyr)
df2 %>%
left_join(df1 %>%
distinct(id, species, study), by = 'id')
# A tibble: 5 x 5
# id meanFreq minFreq species study
# <fct> <dbl> <dbl> <fct> <fct>
#1 a 10.5 10 dog UK
#2 b 14.5 12 cat UK
#3 c 14.5 12 bird UK
#4 d 10 7 cat UK
#5 e 11 6 bee UK
Or use the same logic with the base R
merge(df2,unique(df1[c(1:2, 4)]),by = "id", all.x = TRUE)
Time for mutate followed by distinct:
df1 %>% group_by(id) %>%
mutate(meanFreq = mean(freq), minFreq = min(freq)) %>%
distinct(id, .keep_all = T)
Now actually there are two possibilities: either id and species are essentially the same in your df, one is just a label for the other, or the same id can have several species.
If the latter is the case, you will need to replace the last line with distinct(id, species, .keep_all = T).
This would get you:
# A tibble: 5 x 6
# Groups: id [5]
id species freq study meanFreq minFreq
<fct> <fct> <int> <fct> <dbl> <dbl>
1 a dog 10 UK 10.5 10
2 b cat 17 UK 14.5 12
3 c bird 12 UK 14.5 12
4 d cat 13 UK 10 7
5 e bee 6 UK 11 6
If your only goal is to keep the species & they are indeed the same as id, you could also just include it in the group_by:
df1 %>% group_by(id, species) %>%
summarise(meanFreq = mean(freq), minFreq = min(freq))
This would then remove study and freq - if you have the need to keep them, you can again replace summarise with mutate and then distinct with .keep_all = T argument.

How to return values from group_by in R dplyr?

Good morning,
I've got a two-column dataset which I'd like to spread to more columns based on a group_by in Dplyr but I'm not sure how.
My data looks like:
Person Case
John A
John B
Bill C
David F
I'd like to be able to transform it to the following structure:
Person Case_1 Case_2 ... Case_n
John A B
Bill C NA
David F NA
My original thought was along the lines of:
data %>%
group_by(Person) %>%
spread()
Error: Please supply column name
What's the easiest, or most R-like way to achieve this?
You should first add a case id to the dataset, which can be done with a combination of group_by and mutate:
dat = data.frame(Person = c('John', 'John', 'Bill', 'David'), Case = c('A', 'B', 'C', 'F'))
dat = dat %>% group_by(Person) %>% mutate(id = sprintf('Case_%d', row_number()))
dat %>% head()
# A tibble: 4 × 3
Person Case id
<fctr> <fctr> <chr>
1 John A Case_1
2 John B Case_2
3 Bill C Case_1
4 David F Case_1
Now you can use spread to transform the data:
dat %>% spread(Person, Case)
# A tibble: 2 × 4
id Bill David John
* <chr> <fctr> <fctr> <fctr>
1 Case_1 C F A
2 Case_2 NA NA B
You can get the structure you list above using:
res = dat %>% spread(Person, Case) %>% select(-id) %>% t() %>% as.data.frame()
names(res) = unique(dat$id)
res
Case_1 Case_2
Bill C <NA>
David F <NA>
John A B

Resources