I am a beginner in R coding and I have the sample data here. I am trying to extract all the entries which have 2 "d7" and one "d1" for identical Idvalue number.
Sample name Idvalue_number
a d1 1
f d7 1
b d7 1
s d1 5
g d7 5
r d7 5
z d1 7
y d7 7
d d1 7
Expected output
a d1 1
f d7 1
b d7 1
s d1 5
g d7 5
r d7 5
Some code I have tried which is not giving me the desired output is here:
d1d7 <- data_ %>%
group_by(dvalue_number) %>%
filter(n() >= 3 & any(name == first(name)))
Could someone help me here? Thanks in advance.
One way that you can do it is shown below.]
library(tidyverse)
#create a dataframe for example
df = data.frame(Sample = c("a", "f", "b", "s", "g", "r", "z", "y", "d"),
name = c("d1", "d7", "d7", "d1", "d7", "d7", "d7", "d7","d7"),
Idvalue_number = c(1, 1, 1, 5, 5, 5, 7, 7, 7))
df %>% group_by(Idvalue_number, name) %>%
summarise(total = n()) %>%
filter(name == "d1" & total == 1 | name == "d7" & total == 2)
Idvalue_number name total
<dbl> <fct> <int>
1 1 d1 1
2 1 d7 2
3 5 d1 1
4 5 d7 2
An option would be to filter based on the frequency of 'd1', 'd7' in each 'Idvalue_number'
library(dplyr)
data_ %>%
group_by(Idvalue_number) %>%
filter(n() >= 3, sum(name == 'd1') == 1, sum(name == "d7")== 2)
# A tibble: 6 x 3
# Groups: Idvalue_number [2]
# Sample name Idvalue_number
# <chr> <chr> <int>
#1 a d1 1
#2 f d7 1
#3 b d7 1
#4 s d1 5
#5 g d7 5
#6 r d7 5
data
data_ <- structure(list(Sample = c("a", "f", "b", "s", "g", "r", "z",
"y", "d"), name = c("d1", "d7", "d7", "d1", "d7", "d7", "d1",
"d7", "d1"), Idvalue_number = c(1L, 1L, 1L, 5L, 5L, 5L, 7L, 7L,
7L)), class = "data.frame", row.names = c(NA, -9L))
Related
There is a data frame with multiple values in one column.
I want to change the rows and columns of this data frame.
like this..
data:
result:
What should I do?
Please try this
wide_df <- df %>% group_by(type) %>%
mutate(row= row_number()) %>%
pivot_wider(names_from = type,values_from = value) %>%
select(-row)
Edit I need to check your output in your case, suppose when two rows 1, D, D2 and 2, A, A1 are added in input
Old answer Actually you want to distribute every available value of cols A to D for each id. So I used a temporary code to calculate the number of rows that have to be generated for each id by tmp code first. Thereafter I gathered values by pivoting wider the data and lastly replicating the list to desired number of times.
Follow it like this..
#load libraries
library(dplyr)
library(tidyr)
#calculate number of rows to generate
tmp <- df %>% group_by(id, type) %>%
mutate(tmp = n()) %>%
summarise(tmp = max(tmp)) %>%
group_by(id) %>%
summarise(tmp = prod(tmp))
#store this value in variable n
n <- tmp$tmp
#final code
df %>% pivot_wider(names_from = type, values_from = value,
values_fn = function(x){
l <- list(x)
list(rep(l[[1]], n/length(l[[1]])))
}) %>%
unnest(-id)
# A tibble: 6 x 5
id A B C D
<int> <chr> <chr> <chr> <chr>
1 1 A1 B1 C1 D1
2 1 A1 B2 C2 D1
3 1 A1 B1 C3 D1
4 1 A1 B2 C1 D1
5 1 A1 B1 C2 D1
6 1 A1 B2 C3 D1
dput used
df <- structure(list(id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L), type = c("A",
"B", "B", "C", "C", "C", "D"), value = c("A1", "B1", "B2", "C1",
"C2", "C3", "D1")), class = "data.frame", row.names = c(NA, -7L
))
I think you can do this with unstack and expand.grid in case order does not matter and id is not used:
expand.grid(unstack(x[3:2]))
# A B C D
#1 A1 B1 C1 D1
#2 A1 B2 C1 D1
#3 A1 B1 C2 D1
#4 A1 B2 C2 D1
#5 A1 B1 C3 D1
#6 A1 B2 C3 D1
Data:
x <- data.frame(id = 1, type = c("A", "B", "B", "C", "C", "C", "D")
, value = c("A1", "B1", "B2", "C1", "C2", "C3", "D1"))
Sadly, the posted answers didn't get the right results.
So I got the results my way, But I don't know if this is an efficient way.
df <- data.frame(id = 1, type = c("A", "B", "B", "C", "C", "C", "D")
, val = c("A1", "B1", "B2", "C1", "C2", "C3", "D1"))
tmp <- aggregate(val~id+type, df, toString)
result_df <- tmp %>% spread(key = "type", value = "val") %>%
separate_rows(A, sep=",") %>% separate_rows(B, sep=",") %>% separate_rows(C, sep=",") %>% separate_rows(D, sep=",")
result:
Anyway, thank you all so much!
I need your kind help tidying data using R.
My original data looks like this:
> dput(mydata)
structure(list(subject = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("N1", "E1"), class = "factor"), item_number = c(1,
2, 1, 7, 1, 2, 2, 10), block = c(1, 1, 3, 3, 1, 1, 3, 3), condition = c("L",
"L", "EI", "I", "L", "L", "EI", "I")), row.names = c(NA, 8L), class = "data.frame")
> mydata
subject item_number block condition
1 N1 1 1 L
2 N1 2 1 L
3 N1 1 3 EI
4 N1 7 3 I
5 E1 1 1 L
6 E1 2 1 L
7 E1 2 3 EI
8 E1 10 3 I
For some programming error, I could not label conditions in block 1 correctly. So, I am trying to adjust that by renaming condition in block 1 for different subjects and for different item numbers. Ideally, any item_number in block 1 that is given the value L for condition should be renamed based on the condition label given to the same item_number in block 3. For example, for the subject N1, if the item_number 1 exists in block 3 and is given the label EI for condition, then, the condition label for item_number 1 in block 1 should be set to the same label which is 'EI'. If the item_number 2 does not exist in block 3 for subject N1, then the condition label for item number 2 in block 1 should be 'E'.
The desired output should look like this:
dput(mydata_cleaned)
structure(list(subject = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("N1", "E1"), class = "factor"), item_number = c(1,
2, 1, 7, 1, 2, 2, 10), block = c(1, 1, 3, 3, 1, 1, 3, 3), condition = c("EI",
"E", "EI", "I", "E", "EI", "EI", "I")), row.names = c(NA, 8L), class = "data.frame")
> mydata_cleaned
subject item_number block condition
1 N1 1 1 EI
2 N1 2 1 E
3 N1 1 3 EI
4 N1 7 3 I
5 E1 1 1 E
6 E1 2 1 EI
7 E1 2 3 EI
8 E1 10 3 I
Any help is greatly appreciated.
An option is to reshape to 'wide' format with column names created from 'block', then do the replacement on the column 1 based on values of 3 and reshape back to 'long' format
library(dplyr)
library(tidyr)
mydata %>%
pivot_wider(names_from = block, values_from = condition) %>%
mutate(`1` = case_when(`3` %in% "EI" & `1` %in% "L" ~ `3`,
is.na(`3`) ~ 'E', TRUE ~ `1`)) %>%
pivot_longer(cols = c(`1`, `3`), names_to = 'block',
values_to = 'condition', values_drop_na = TRUE)
-output
# A tibble: 8 x 4
# subject item_number block condition
# <fct> <dbl> <chr> <chr>
#1 N1 1 1 EI
#2 N1 1 3 EI
#3 N1 2 1 E
#4 N1 7 3 I
#5 E1 1 1 E
#6 E1 2 1 EI
#7 E1 2 3 EI
#8 E1 10 3 I
I have a data frame named df which looks like.
x y
A NA
B d1
L d2
F c1
L s2
A c4
B NA
B NA
A c1
F a5
G NA
H NA
I want to group by x and fill in NA values with the first non-NA element in that group if possible. Note that some groups will not have a non-NA element so returning NA is fine for that case.
df %>% group_by(x) %>% mutate(new_y = first(y))
returns the first value including NA's even when non-NA values exist for that group.
We can use replace
df %>%
group_by(x) %>%
mutate(y = replace(y, is.na(y), y[!is.na(y)][1]))
# x y
# <chr> <chr>
#1 A c4
#2 B d1
#3 L d2
#4 F c1
#5 L s2
#6 A c4
#7 B d1
#8 B d1
#9 A c1
#10 F a5
#11 G <NA>
#12 H <NA>
Or we can do a join in data.table
library(data.table)
library(tidyr)
setDT(df)[df[order(x, is.na(y)), .SD[1L], x], y := coalesce(y, i.y),on = .(x)]
df
# x y
# 1: A c4
# 2: B d1
# 3: L d2
# 4: F c1
# 5: L s2
# 6: A c4
# 7: B d1
# 8: B d1
# 9: A c1
#10: F a5
#11: G NA
#12: H NA
Or using base R
df$y <- with(df, ave(y, x, FUN = function(x) replace(x, is.na(x), x[!is.na(x)][1])))
data
df <- structure(list(x = c("A", "B", "L", "F", "L", "A", "B", "B",
"A", "F", "G", "H"), y = c(NA, "d1", "d2", "c1", "s2", "c4",
NA, NA, "c1", "a5", NA, NA)), .Names = c("x", "y"), class = "data.frame",
row.names = c(NA, -12L))
Starting with the Dataframe y:
x <- c(2,NA,6,8,9,10)
y <- data.frame(letters[1:6], 1:6, NA, 3:8, NA, x, NA)
colnames(y) <- c("Patient", "C1", "First_C1", "C2", "First_C2", "C3", "First_C3")
I want R to look at each element of C1, find out the first patient (first row) with that element and the identify in which column it is, and add the "coordinates" "Patient_Column" to First_element_C1... Then, do the same with C2 and C3.
So, the result should be this:
y$First_C1 <- c("a_C1", "a_C3", "a_C2", "b_C2", "c_C2", "c_C3")
y$First_C2 <- c("a_C2", "b_C2", "c_C2", "c_C3", "e_C2", "d_C3")
y$First_C3 <- c("a_C3", NA, "c_C3", "d_C3", "e_C3", "f_C3")
I dont know how to write the code, not even how to search for it... Could someone help me here?
We start from the y without the output columns:
y<-structure(list(Patient = structure(1:6, .Label = c("a", "b",
"c", "d", "e", "f"), class = "factor"), C1 = 1:6, C2 = 3:8, C3 = c(2,
NA, 6, 8, 9, 10)), .Names = c("Patient", "C1", "C2", "C3"), row.names = c(NA,
-6L), class = "data.frame")
Then, we can try:
y[paste0("First_C",1:3)]<-lapply(y[,2:4],
function(x) {
d<-arrayInd(match(x,t(y[,2:4])),dim(t(y[,2:4])))[,2:1]
paste(y$Patient[d[,1]],colnames(y[,2:4])[d[,2]],sep="_")
})
y[,5:7][is.na(y[,2:4])]<-NA
# Patient C1 C2 C3 First_C1 First_C2 First_C3
#1 a 1 3 2 a_C1 a_C2 a_C3
#2 b 2 4 NA a_C3 b_C2 <NA>
#3 c 3 5 6 a_C2 c_C2 c_C3
#4 d 4 6 8 b_C2 c_C3 d_C3
#5 e 5 7 9 c_C2 e_C2 e_C3
#6 f 6 8 10 c_C3 d_C3 f_C3
I have two data.frames--one look-up table that tells me a set products included in a group. Each group has at least one product of Type 1 and Type 2.
The second data.frame tells me details about the transaction. Each transaction can have one of the following products:
a) Only products of Type 1 from one of the groups
b) Only products of Type 2 from one of the groups
c) Product of Type 1 and Type 2 from the same group
For my analysis, I am interested in finding out c) above i.e. how many transactions have products of Type 1 and Type 2 (from the same group) sold. We will ignore the transaction altogether if Product of Type 1 and that of Type 2 from different groups that are sold in the same transaction.
Thus, each product of Type 1 or Type 2 MUST belong to the same group.
Here's my look up table:
> P_Lookup
Group ProductID1 ProductID2
Group1 A 1
Group1 B 2
Group1 B 3
Group2 C 4
Group2 C 5
Group2 C 6
Group3 D 7
Group3 C 8
Group3 C 9
Group4 E 10
Group4 F 11
Group4 G 12
Group5 H 13
Group5 H 14
Group5 H 15
For instance, I won't have Product G and Product 15 in one transaction because they belong to different group.
Here are the transactions:
TransactionID ProductID ProductType
a1 A 1
a1 B 1
a1 1 2
a2 C 1
a2 4 2
a2 5 2
a3 D 1
a3 C 1
a3 7 2
a3 8 2
a4 H 1
a5 1 2
a5 2 2
a5 3 2
a5 3 2
a5 1 2
a6 H 1
a6 15 2
My Code:
Now, I was able to write code using dplyr for shortlisting transactions from one group. However, I am not sure how I can vectorize my code for all groups.
Here's my code:
P_Groups<-unique(P_Lookup$Group)
Chosen_Group<-P_Groups[5]
P_Group_Ind <- P_Trans %>%
group_by(TransactionID)%>%
dplyr::filter((ProductID %in% unique(P_Lookup[P_Lookup$Group==Chosen_Group,]$ProductID1)) |
(ProductID %in% unique(P_Lookup[P_Lookup$Group==Chosen_Group,]$ProductID2)) ) %>%
mutate(No_of_PIDs = n_distinct(ProductType)) %>%
mutate(Group_Name = Chosen_Group)
P_Group_Ind<-P_Group_Ind[P_Group_Ind$No_of_PIDs>1,]
This works well as long as I manually select each group i.e. by setting Chosen_Group. However, I am not sure how I can automate this. One way, I am thinking is to use for loop, but I know that the beauty of R is vectorization, so I want to stay away from using for loop.
I'd sincerely appreciate any help. I have spent almost two days on this. I looked at using dplyr in for loop in r, but it seems this thread is talking about a different issue.
DATA:
Here's dput for P_Trans:
structure(list(TransactionID = c("a1", "a1", "a1", "a2", "a2",
"a2", "a3", "a3", "a3", "a3", "a4", "a5", "a5", "a5", "a5", "a5",
"a6", "a6"), ProductID = c("A", "B", "1", "C", "4", "5", "D",
"C", "7", "8", "H", "1", "2", "3", "3", "1", "H", "15"), ProductType = c(1,
1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2)), .Names = c("TransactionID",
"ProductID", "ProductType"), row.names = c(NA, 18L), class = "data.frame")
Here's dput for P_Lookup:
structure(list(Group = c("Group1", "Group1", "Group1", "Group2",
"Group2", "Group2", "Group3", "Group3", "Group3", "Group4", "Group4",
"Group4", "Group5", "Group5", "Group5"), ProductID1 = c("A",
"B", "B", "C", "C", "C", "D", "C", "C", "E", "F", "G", "H", "H",
"H"), ProductID2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
14, 15)), .Names = c("Group", "ProductID1", "ProductID2"), row.names = c(NA,
15L), class = "data.frame")
Here's the dput() after adding a product to P_Trans that doesn't exist in the look-up table:
structure(list(TransactionID = c("a1", "a1", "a1", "a2", "a2",
"a2", "a3", "a3", "a3", "a3", "a4", "a5", "a5", "a5", "a5", "a5",
"a6", "a6", "a7"), ProductID = c("A", "B", "1", "C", "4", "5",
"D", "C", "7", "8", "H", "1", "2", "3", "3", "1", "H", "15",
"22"), ProductType = c(1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2,
2, 2, 2, 1, 2, 3)), .Names = c("TransactionID", "ProductID",
"ProductType"), row.names = c(NA, 19L), class = "data.frame")
Below is a tidyverse (dplyr, tidyr, and purrr) solution that I hope will help.
Note that the use of map_df in the last line returns all results as a data frame. If you'd prefer it to be a list object for each group, then simply use map.
library(dplyr)
library(tidyr)
library(purrr)
# Save unique groups for later use
P_Groups <- unique(P_Lookup$Group)
# Convert lookup table to product IDs and Groups
P_Lookup <- P_Lookup %>%
gather(ProductIDn, ProductID, ProductID1, ProductID2) %>%
select(ProductID, Group) %>%
distinct() %>%
nest(-ProductID, .key = Group)
# Bind Group information to transactions
# and group for next analysis
P_Trans <- P_Trans %>%
left_join(P_Lookup) %>%
filter(!map_lgl(Group, is.null)) %>%
unnest(Group) %>%
group_by(TransactionID)
# Iterate through Groups to produce results
map(P_Groups, ~ filter(P_Trans, Group == .)) %>%
map(~ mutate(., No_of_PIDs = n_distinct(ProductType))) %>%
map_df(~ filter(., No_of_PIDs > 1))
#> Source: local data frame [12 x 5]
#> Groups: TransactionID [4]
#>
#> TransactionID ProductID ProductType Group No_of_PIDs
#> <chr> <chr> <dbl> <chr> <int>
#> 1 a1 A 1 Group1 2
#> 2 a1 B 1 Group1 2
#> 3 a1 1 2 Group1 2
#> 4 a2 C 1 Group2 2
#> 5 a2 4 2 Group2 2
#> 6 a2 5 2 Group2 2
#> 7 a3 D 1 Group3 2
#> 8 a3 C 1 Group3 2
#> 9 a3 7 2 Group3 2
#> 10 a3 8 2 Group3 2
#> 11 a6 H 1 Group5 2
#> 12 a6 15 2 Group5 2
Here is a single pipe dplyr solution:
P_DualGroupTransactionsCount <-
P_Lookup %>% # data needing single column map of Keys
gather(IDnum, ProductID, ProductID1:ProductID2) %>% # produce long single map of Keys for GroupID (tidyr::)
right_join(P_trans) %>% # join transactions to groupID info
group_by(TransactionID, Group) %>% # organize for same transaction & same group
mutate(DualGroup = ifelse(n_distinct(ProductType)==2, T, F)) %>% # flag groups with both groups in a single transaction
filter(DualGroup == T) %>% # choose only doubles
select(TransactionID, Group) %>% # remove excess columns
distinct %>% # remove excess rows
nrow # count of unique transaction ID's
# P_DualGroupTransactions
# Source: local data frame [4 x 2]
# Groups: TransactionID, Group [4]
#
# TransactionID Group
# <chr> <chr>
# 1 a1 Group1
# 2 a2 Group2
# 3 a3 Group3
# 4 a6 Group5
# P_DualGroupTransactionsCount
[1] 4