I need your kind help tidying data using R.
My original data looks like this:
> dput(mydata)
structure(list(subject = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("N1", "E1"), class = "factor"), item_number = c(1,
2, 1, 7, 1, 2, 2, 10), block = c(1, 1, 3, 3, 1, 1, 3, 3), condition = c("L",
"L", "EI", "I", "L", "L", "EI", "I")), row.names = c(NA, 8L), class = "data.frame")
> mydata
subject item_number block condition
1 N1 1 1 L
2 N1 2 1 L
3 N1 1 3 EI
4 N1 7 3 I
5 E1 1 1 L
6 E1 2 1 L
7 E1 2 3 EI
8 E1 10 3 I
For some programming error, I could not label conditions in block 1 correctly. So, I am trying to adjust that by renaming condition in block 1 for different subjects and for different item numbers. Ideally, any item_number in block 1 that is given the value L for condition should be renamed based on the condition label given to the same item_number in block 3. For example, for the subject N1, if the item_number 1 exists in block 3 and is given the label EI for condition, then, the condition label for item_number 1 in block 1 should be set to the same label which is 'EI'. If the item_number 2 does not exist in block 3 for subject N1, then the condition label for item number 2 in block 1 should be 'E'.
The desired output should look like this:
dput(mydata_cleaned)
structure(list(subject = structure(c(1L, 1L, 1L, 1L, 2L, 2L,
2L, 2L), .Label = c("N1", "E1"), class = "factor"), item_number = c(1,
2, 1, 7, 1, 2, 2, 10), block = c(1, 1, 3, 3, 1, 1, 3, 3), condition = c("EI",
"E", "EI", "I", "E", "EI", "EI", "I")), row.names = c(NA, 8L), class = "data.frame")
> mydata_cleaned
subject item_number block condition
1 N1 1 1 EI
2 N1 2 1 E
3 N1 1 3 EI
4 N1 7 3 I
5 E1 1 1 E
6 E1 2 1 EI
7 E1 2 3 EI
8 E1 10 3 I
Any help is greatly appreciated.
An option is to reshape to 'wide' format with column names created from 'block', then do the replacement on the column 1 based on values of 3 and reshape back to 'long' format
library(dplyr)
library(tidyr)
mydata %>%
pivot_wider(names_from = block, values_from = condition) %>%
mutate(`1` = case_when(`3` %in% "EI" & `1` %in% "L" ~ `3`,
is.na(`3`) ~ 'E', TRUE ~ `1`)) %>%
pivot_longer(cols = c(`1`, `3`), names_to = 'block',
values_to = 'condition', values_drop_na = TRUE)
-output
# A tibble: 8 x 4
# subject item_number block condition
# <fct> <dbl> <chr> <chr>
#1 N1 1 1 EI
#2 N1 1 3 EI
#3 N1 2 1 E
#4 N1 7 3 I
#5 E1 1 1 E
#6 E1 2 1 EI
#7 E1 2 3 EI
#8 E1 10 3 I
Related
I have a dataframe in R that has a large number of bank_account_IDs and Vendor_Codes. Bank_account_IDs should not be shared between Vendor_Codes, but sometimes a fraudulent vendor exists that shares another vendor's bank_account_ID.
I want to add a new field to the dataframe that provides a count for the number of times an account_ID exists with more than 1 Vendor_Code.
My sample dataframe is as follows:
bank_account_ID = c(a, b, c, a, a, d, e, f, b, c)
Vendor_Code = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
df <-data.frame(a,b)
My ideal new dataframe should look something like this:
bank_account_ID Vendor_Code duplicate_count
a 1 2
b 2 1
c 3 1
a 4 2
a 5 2
d 6 0
e 7 0
f 8 0
b 9 1
c 10 1
Thanks in advance!
We can get the number of distinct elements with n_distinct grouped by the 'bank_account_ID' and subtract 1
library(dplyr)
df %>%
group_by(bank_account_ID) %>%
mutate(dupe_count = n_distinct(Vendor_Code)-1) %>%
ungroup
-output
# A tibble: 10 x 4
# bank_account_ID Vendor_Code duplicate_count dupe_count
# <chr> <int> <int> <dbl>
# 1 a 1 2 2
# 2 b 2 1 1
# 3 c 3 1 1
# 4 a 4 2 2
# 5 a 5 2 2
# 6 d 6 0 0
# 7 e 7 0 0
# 8 f 8 0 0
# 9 b 9 1 1
#10 c 10 1 1
data
df <- structure(list(bank_account_ID = c("a", "b", "c", "a", "a", "d",
"e", "f", "b", "c"), Vendor_Code = 1:10, duplicate_count = c(2L,
1L, 1L, 2L, 2L, 0L, 0L, 0L, 1L, 1L)), class = "data.frame", row.names = c(NA,
-10L))
Having the following table:
read.table(text = "route origin dest seq
1 a b 1
1 b c 2
1 c d 3
1 d e 4
2 f g 1
2 g h 2
2 h i 3", header = TRUE)
I'm trying to find a way of going through each row, grouped by route, and iterate every potential combination of origin destination pairs, taking into account the seq variable and the route as mentioned.
The output should look something like this:
origin dest
a b
a c
a d
a e
b c
b d
(...) (...)
The idea behind this is that a train e.g route 1, goes from a to e. However, I want to list every single possibility of train pairs with that. I tried with igraph but unsuccessfully.
Any ideas with dplyr or so?
library(dplyr)
library(tidyr)
df %>%
mutate_if(is.factor, as.character) %>% #convert factor variable to character
group_by(route) %>%
expand(origin = paste(origin, seq, sep = "_"), dest = paste(dest, seq, sep = "_")) %>% #all possible combination of origin & destination grouped by route
rowwise() %>%
filter(strsplit(origin, split = "_")[[1]][1] != strsplit(dest, split = "_")[[1]][1] &
strsplit(origin, split = "_")[[1]][2] <= strsplit(dest, split = "_")[[1]][2]) %>%
mutate(origin = gsub("_.*$", "", origin),
dest = gsub("_.*$", "", dest))
Output is:
route origin dest
1 1 a b
2 1 a c
3 1 a d
4 1 a e
5 1 b c
...
Sample data:
df <- structure(list(route = c(1L, 1L, 1L, 1L, 2L, 2L, 2L), origin = structure(1:7, .Label = c("a",
"b", "c", "d", "f", "g", "h"), class = "factor"), dest = structure(1:7, .Label = c("b",
"c", "d", "e", "g", "h", "i"), class = "factor"), seq = c(1L,
2L, 3L, 4L, 1L, 2L, 3L)), class = "data.frame", row.names = c(NA,
-7L))
# route origin dest seq
#1 1 a b 1
#2 1 b c 2
#3 1 c d 3
#4 1 d e 4
#5 2 f g 1
#6 2 g h 2
#7 2 h i 3
I got a data.frame that looks like the following one:
OBJECT ID TASK
1 A
1 C
1 D
1 E
2 A
2 B
2 C
2 D
2 F
Now I would like to count the unique successive combinations within the data.frame in order to get following result:
PREDECESSOR SUCCESSOR COUNT
A C 1
C D 2
D E 1
A B 1
B C 1
D F 1
I've already figured out to extract the successive values with the help of two for loops, but I'm failing the assignment and counting task within a new data.frame (or list).
aggregate(COUNT~.,
data.frame(PREDECESSOR = head(df1$TASK, -1),
SUCCESSOR = tail(df1$TASK, -1),
COUNT = 1),
length)
# PREDECESSOR SUCCESSOR COUNT
#1 E A 1
#2 A B 1
#3 A C 1
#4 B C 1
#5 C D 2
#6 D E 1
#7 D F 1
You could use a similar approach even if you want to first split by OBJECT.ID
temp = do.call(rbind, lapply(split(df1, df1$OBJECT.ID), function(X){
aggregate(COUNT~., data.frame(PREDECESSOR = head(X$TASK, -1),
SUCCESSOR = tail(X$TASK, -1),
COUNT = 1),
length)
}))
aggregate(COUNT~., temp, length)
# PREDECESSOR SUCCESSOR COUNT
#1 A C 1
#2 B C 1
#3 C D 2
#4 D E 1
#5 A B 1
#6 D F 1
DATA
df1 = structure(list(OBJECT.ID = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L), TASK = c("A", "C", "D", "E", "A", "B", "C", "D", "F")), .Names = c("OBJECT.ID",
"TASK"), class = "data.frame", row.names = c(NA, -9L))
Solution using data.table:
Code:
library(data.table)
setDT(df)
df[, TASK0 := shift(TASK), OBJECT]
df[!is.na(TASK0), .N, .(TASK, TASK0)][, .(
COUNT = sum(N)), .(PREDECESSOR = TASK0, SUCCESSOR = TASK)]
Result:
PREDECESSOR SUCCESSOR COUNT
1: A C 1
2: C D 2
3: D E 1
4: A B 1
5: B C 1
6: D F 1
Explanation:
setDT(df): turns data.frame into a data.table object
[, TASK0 := shift(TASK), OBJECT]: gets previous letter for each OBJECT
!is.na(TASK0): gets rid of first row for each OBJECT (they don't have PREDECESSOR)
.N, .(TASK, TASK0): counts occurences of TASK and TASK0 (previous letter combinations)
sum(N): sums counts
Data (df):
structure(list(OBJECT = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L),
TASK = c("A", "C", "D", "E", "A", "B", "C", "D", "F")), .Names = c("OBJECT",
"TASK"), row.names = c(NA, -9L), class = c("data.table", "data.frame"
))
Just to get the counts, you do it with the following two lines:
cc <- cbind(df$TASK,c(df$TASK[-1],"LAST"))
table(paste(cc[,1],cc[2],sep="-"))
The result is
A-B A-C B-C C-D D-E D-F E-A F-LAST
1 1 1 2 1 1 1 1
My data frame looks like below:
df<-data.frame(alphabets1=c("A","B","C","B","C"," ","NA"),alphabets2=c("B","A","D","D"," ","E","NA"),alphabets3=c("C","F","G"," "," "," ","NA"), number = c("1","2","3","1","4","1","2"))
alphabets1 alphabets2 alphabets3 number
1 A B C 1
2 B A F 2
3 C D G 3
4 B D 1
5 C 4
6 E 1
7 NA NA NA 2
NOTE1: within the row all the values are unique, that is, below shown is not possible.
alphabets1 alphabets2 alphabets3 number
1 A A C 1
NOTE2: data frame may contains NA or is blank
I am struggling to get the below output: which is nothing but a dataframe which has the alphabets and the sum of their corresponding numbers, that is A alphabet is in 1st and 2nd rows so its sum of its corresponding number is 1+2 i.e 3 and let's say B, its in 1st, 2nd and 4th row so the sum will be 1+2+1 i.e 4.
output <-data.frame(alphabets1=c("A","B","C","D","E","F","G"), number = c("3","4","8","4","1","2","3"))
output
alphabets number
1 A 3
2 B 4
3 C 8
4 D 4
5 E 1
6 F 2
7 G 3
NOTE3: output may or may not have the NA or blanks (it doesn't matter!)
We can reshape it to 'long' format and do a group by operation
library(data.table)
melt(setDT(df), id.var="number", na.rm = TRUE, value.name = "alphabets1")[
!grepl("^\\s*$", alphabets1), .(number = sum(as.integer(as.character(number)))),
alphabets1]
# alphabets1 number
#1: A 3
#2: B 4
#3: C 8
#4: D 4
#5: E 1
#6: F 2
#7: G 3
Or we can use xtabs from base R
xtabs(number~alphabets1, data.frame(alphabets1 = unlist(df[-4]),
number = as.numeric(as.character(df[,4]))))
NOTE: In the OP's dataset, the missing values were "NA", and not real NA and the 'number' column is factor (which was changed by converting to integer for doing the sum)
data
df <- data.frame(alphabets1=c("A","B","C","B","C"," ",NA),
alphabets2=c("B","A","D","D"," ","E",NA),
alphabets3=c("C","F","G"," "," "," ",NA),
number = c("1","2","3","1","4","1","2"))
Here is a base R method using sapply and table. I first converted df$number into a numeric. See data section below.
data.frame(table(sapply(df[-length(df)], function(i) rep(i, df$number))))
Var1 Freq
1 11
2 A 3
3 B 4
4 C 8
5 D 4
6 E 1
7 F 2
8 G 3
9 NA 6
To make the output a little bit nicer, we could wrap a few more functions and perform a subsetting within sapply.
data.frame(table(droplevels(unlist(sapply(df[-length(df)],
function(i) rep(i[i %in% LETTERS],
df$number[i %in% LETTERS])),
use.names=FALSE))))
Var1 Freq
1 A 3
2 B 4
3 C 8
4 D 4
5 E 1
6 F 2
7 G 3
It may be easier to do this afterward, though.
data
I ran
df$number <- as.numeric(df$number)
on the OP's data resulting in this.
df <-
structure(list(alphabets1 = structure(c(2L, 3L, 4L, 3L, 4L, 1L,
5L), .Label = c(" ", "A", "B", "C", "NA"), class = "factor"),
alphabets2 = structure(c(3L, 2L, 4L, 4L, 1L, 5L, 6L), .Label = c(" ",
"A", "B", "D", "E", "NA"), class = "factor"), alphabets3 = structure(c(2L,
3L, 4L, 1L, 1L, 1L, 5L), .Label = c(" ", "C", "F", "G", "NA"
), class = "factor"), number = c(1, 2, 3, 1, 4, 1, 2)), .Names = c("alphabets1",
"alphabets2", "alphabets3", "number"), row.names = c(NA, -7L), class = "data.frame")
I have a data.frame where I'd like to remove entire groups if any of their members meets a condition.
In this first example, if the values are numbers and the condition is NA the code below works.
df <- structure(list(world = c(1, 2, 3, 3, 2, NA, 1, 2, 3, 2), place = c(1,
1, 2, 2, 3, 3, 1, 2, 3, 1), group = c(1, 1, 1, 2, 2, 2, 3,
3, 3, 3)), .Names = c("world", "place", "group"), row.names = c(NA,
-10L), class = "data.frame")
ans <- ddply(df, . (group), summarize, code=mean(world))
ans$code[is.na(ans$code)] <- 0
ans2 <- merge(df,ans)
final.ans <- ans2[ans2$code !=0,]
However, this ddply maneuver with the NA values will not work if the condition is something other than "NA", or if the value are non-numeric.
For example, if I wanted to remove groups which have one or more rows with a world value of AF (as in the data frame below) this ddply trick would not work.
df2 <-structure(list(world = structure(c(1L, 2L, 3L, 3L, 3L, 5L, 1L,
4L, 2L, 4L), .Label = c("AB", "AC", "AD", "AE", "AF"), class = "factor"),
place = c(1, 1, 2, 2, 3, 3, 1, 2, 3, 1), group = c(1,
1, 1, 2, 2, 2, 3, 3, 3, 3)), .Names = c("world", "place",
"group"), row.names = c(NA, -10L), class = "data.frame")
I can envision a for-loop where for each group the value of each member is checked, and if the condition is met a code column could be populated, and then a subset could me made based on that code.
But, perhaps there is a vectorized, r way to do this?
Using dplyr:
library(dplyr)
df2 %>%
group_by(group) %>%
filter(!any(world == "AF"))
Or as per motioned by #akrun using data.table:
library(data.table)
setDT(df2)[, if(!any(world == "AF")) .SD, group]
#or
setDT(df2)[, if(all(world != "AF")) .SD, group]
Which gives:
#Source: local data frame [7 x 3]
#Groups: group
#
# world place group
#1 AB 1 1
#2 AC 1 1
#3 AD 2 1
#4 AB 1 3
#5 AE 2 3
#6 AC 3 3
#7 AE 1 3
Alternate data.table solution:
setDT(df2)
df2[!(group %in% df2[world == "AF",group])]
gives:
world place group
1: AB 1 1
2: AC 1 1
3: AD 2 1
4: AB 1 3
5: AE 2 3
6: AC 3 3
7: AE 1 3
Using keys we can be a bit faster:
setkey(df2,group)
df2[!J((df2[world == "AF",group]))]
base package:
df2[ df2$group != df2[ df2$world == 'AF', "group" ], ]
Output:
world place group
1 AB 1 1
2 AC 1 1
3 AD 2 1
7 AB 1 3
8 AE 2 3
9 AC 3 3
10 AE 1 3
Using sqldf:
library(sqldf)
sqldf("SELECT df2.world, df2.place, [group] FROM df2
LEFT JOIN
(SELECT * FROM df2 WHERE world LIKE 'AF') AS t
USING([group])
WHERE t.world IS NULL")
Output:
world place group
1 AB 1 1
2 AC 1 1
3 AD 2 1
4 AB 1 3
5 AE 2 3
6 AC 3 3
7 AE 1 3
Base R option using ave
df2[with(df2, ave(world != "AF", group, FUN = all)),]
# world place group
#1 AB 1 1
#2 AC 1 1
#3 AD 2 1
#7 AB 1 3
#8 AE 2 3
#9 AC 3 3
#10 AE 1 3
Or we can also use subset
subset(df2, ave(world != "AF", group, FUN = all))
The above can also be written as
df2[with(df2, !ave(world == "AF", group, FUN = any)),]
and
subset(df2, !ave(world == "AF", group, FUN = any))