DataFrames.transform specifying target variable with anonymous function in Julia - julia

I am trying to use transform with an anonymous function (x -> uppercase.(x)) and store the new column as "A" by specifying a target column name (:A).
If I don't specify a target column variable (first transformation below), the new variable is produced fine (i.e. a Vector with 5 elements). However, once I specify the target column (second transformation below), the function returns a Vector of Pairs under the "a_function" name.
How can I produce the desired DataFrame with a new column "A" containing a Vector with 5 elements ("A" to "E")? Why does the second transformation below return a Vector of Pairs with a name different from that specifyed?
using DataFrames
df_1 = DataFrame(a = ["a", "b", "c", "d", "e"])
df_2 = transform(df_1, :a => x -> uppercase.(x)) # first transformation
df_2
Row │ a a_function
│ String String
─────┼────────────────────
1 │ a A
2 │ b B
3 │ c C
4 │ d D
5 │ e E
df_3 = transform(df_1, :a => x -> uppercase.(x) => :A) # second transformation
df_3
5×2 DataFrame
Row │ a a_function
│ String Pair…
─────┼───────────────────────────────────────
1 │ a ["A", "B", "C", "D", "E"]=>:A
2 │ b ["A", "B", "C", "D", "E"]=>:A
3 │ c ["A", "B", "C", "D", "E"]=>:A
4 │ d ["A", "B", "C", "D", "E"]=>:A
5 │ e ["A", "B", "C", "D", "E"]=>:A
Desired outcome DataFrame:
DataFrame(a = ["a", "b", "c", "d", "e"],
A = ["A", "B", "C", "D", "E"])

The reason is operator precedence, if you write:
julia> :a => x -> uppercase.(x) => :A
:a => var"#7#8"()
you see that you have defined only one pair. The whole part uppercase.(x) => :A became the body of your anonymous function.
Instead write (note I added ( and ) around the anonymous function):
julia> :a => (x -> uppercase.(x)) => :A
:a => (var"#9#10"() => :A)
to get what you wanted:
julia> df_3 = transform(df_1, :a => (x -> uppercase.(x)) => :A)
5×2 DataFrame
Row │ a A
│ String String
─────┼────────────────
1 │ a A
2 │ b B
3 │ c C
4 │ d D
5 │ e E
In this case a more standard way to write it would be:
julia> transform(df_1, :a => ByRow(uppercase) => :A)
5×2 DataFrame
Row │ a A
│ String String
─────┼────────────────
1 │ a A
2 │ b B
3 │ c C
4 │ d D
5 │ e E
or even:
julia> transform(df_1, :a => ByRow(uppercase) => uppercase)
5×2 DataFrame
Row │ a A
│ String String
─────┼────────────────
1 │ a A
2 │ b B
3 │ c C
4 │ d D
5 │ e E
The last form is new in DataFrames.jl 1.3, which allows you to pass a function as a destination column name specifier (in this case the transformation was to uppercase the source column name). Of course in this case it is longer, but it is sometimes useful if you define transformations programmatically.

Related

How to shuffle the rows and columns of a DataFrame with a specific seed?

Suppose I have the following DataFrame, and I want to shuffle the rows and columns of the DataFrame with a specific seed value. I tried the following to obtain shuffled indexes, but it gave me a different result every time:
julia> using Random, DataFrames, StatsBase
julia> Random.seed!(123)
julia> df = DataFrame(
col1 = [1, 2, 3],
col2 = [4, 5, 6]
);
julia> idx_row, idx_col = sample.(
[1:size(df, 1), 1:size(df, 2)],
[length(1:size(df, 1)), length(1:size(df, 2))],
replace=false
)
2-element Vector{Vector{Int64}}:
[1, 2, 3]
[2, 1]
julia> idx_row, idx_col = sample.(
[1:size(df, 1), 1:size(df, 2)],
[length(1:size(df, 1)), length(1:size(df, 2))],
replace=false
)
2-element Vector{Vector{Int64}}:
[2, 1, 3]
[2, 1]
As you can see, it's shuffling the values, but it doesn't consider the seed!. How can I shuffle rows and columns of a DataFrame in a reproducible way, like setting a specific seed?
Fortunately, you imported a helpful package named Random. However, you didn't search for the function named shuffle. All can be achieved by the following:
julia> #which shuffle
Random
julia> idx_row, idx_col = shuffle.(
MersenneTwister(123),
[1:size(df, 1), 1:size(df, 2)]
)
2-element Vector{Vector{Int64}}:
[3, 2, 1]
[2, 1]
julia> df[idx_row, idx_col]
3×2 DataFrame
Row │ b a
│ Int64 Int64
─────┼──────────────
1 │ 6 3
2 │ 5 2
3 │ 4 1
The result is reproducible and won't change after each run, despite being a random process.
Additional point
Note that there is a customized dispatch of the shuffle function suitable for shuffling rows of a given DataFrame:
julia> shuffle(MersenneTwister(123), df)
3×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 3 6
2 │ 1 4
3 │ 2 5
*Note that this only shuffles the rows.
You can choose whatever rng you want, e.g. rng = MersenneTwister(113), and use it to shuffle the range of DataFrame size.
r,c = shuffle.(rng, range.(1,size(df)))
([3, 1, 2], [2, 1])
df[r,c]
3×2 DataFrame
Row │ col2 col1
│ Int64 Int64
─────┼──────────────
1 │ 5 2
2 │ 6 3
3 │ 4 1

How to get the size of sets given a list of pairs?

Let's say I have run different tests to see if some objects are identical. The testing was done pairwise, and I have a dataframe containing the pairs of objects that are the same:
same.pairs <- data.frame(Test=c(rep(1, 4), rep(2, 6)),
First=c("A", "A", "B", "D", "A", "A", "B", "C", "C", "D"),
Second=c("B", "C", "C", "E", "B", "E", "E", "D", "G", "G"))
##
Test First Second
1 A B
1 A C
1 B C
1 D E
2 A B
2 A E
2 B E
2 C D
2 C G
2 D G
From this I can see that in Test 1, because A = B and A = C and B = C, then A = B = C and these 3 objects belong in one set of size 3.
I want to know the full size of the sets for each test. For this example, I want to know that for Test 1, one set is 3 identical objects (A, B, C) and one set is 2 (D, E), and for Test 2, two sets are size 3 ((A, B, E) and (C, D, G)). I don't need to know which objects are in each set, just the size of the sets and the counts of how many sets are that size:
Test ReplicateSize Count
1 3 1
1 2 1
2 3 2
Is there an elegant way to do this? I thought I had it with this:
sets <- same.pairs %>%
group_by(Test, First) %>%
summarize(ReplicateSize=n()) %>%
# add 1 to size because above only counting second genotype, need to include first
mutate(ReplicateSize=ReplicateSize+1) %>%
select(-First) %>%
ungroup() %>%
group_by(Test, ReplicateSize) %>%
summarize(Count=n()) %>%
arrange(Test, ReplicateSize)
##
Test ReplicateSize Count
1 2 2
1 3 1
2 2 2
2 3 2
but this is double counting some of the sets as, for example in Test 1, B&C are counted as a set of size 2 instead of ignored as they are already part of a set with A. I'm not sure how to skip rows where the First object has already been observed as the Second object without making a complicated for loop.
Any guidance appreciated.
I don't fully understand what you are trying to accomplish, but your current code could be truncated to the following:
same.pairs %>%
count(Test, First, name = "ReplicateSize") %>%
count(Test, ReplicateSize, name = "Count") %>%
mutate(ReplicateSize = ReplicateSize + 1)
Test ReplicateSize Count
1 1 2 2
2 1 3 1
3 2 2 2
4 2 3 2

R - row-wise combinations of two lists

assume that I have two lists of the same length.
l1 <- list(c("a", "b", "c"), "d")
l2 <- list(c("e", "f"), c("g", "h", "i"))
Each row/element of a list can be seen as a specific pair. So in this example the two vectors
c("a", "b", "c")
c("e", "f")
"belong together" and so do the two others.
I need to get all the possible combinations/permutations of those two vectors with the same index.
I know that I can use expand.grid(c("a", "b", "c"), c("e", "f")) for two vectors, but I'm struggling to do this over both lists iteratively. I tried to use mapply(), but couldn't come up with a solution.
The preferred output can be a dataframe or a list containing all possible row-wise combinations. It's not necessary to keep the information of the "source pair". I'm just interested in the combinations.
So, a possible output could look like this:
l1 l2
1 a e
2 b e
3 c e
4 a f
5 b f
6 c f
7 d g
8 d h
9 d i
You can use Map to loop over the list elements and then use rbind:
do.call(rbind, Map(expand.grid, l1, l2))
# Var1 Var2
#1 a e
#2 b e
#3 c e
#4 a f
#5 b f
#6 c f
#7 d g
#8 d h
#9 d i
Map is just mapply with different defaults.

Removing Only Adjacent Duplicates in Data Frame in R

I have a data frame in R that is supposed to have duplicates. However, there are some duplicates that I would need to remove. In particular, I only want to remove row-adjacent duplicates, but keep the rest. For example, suppose I had the data frame:
df = data.frame(x = c("A", "B", "C", "A", "B", "C", "A", "B", "B", "C"),
y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
This results in the following data frame
x y
A 1
B 2
C 3
A 4
B 5
C 6
A 7
B 8
B 9
C 10
In this case, I expect there to be repeating "A, B, C, A, B, C, etc.". However, it is only a problem if I see adjacent row duplicates. In my example above, that would be rows 8 and 9 with the duplicate "B" being adjacent to each other.
In my data set, whenever this occurs, the first instance is always a user-error, and the second is always the correct version. In very rare cases, there might be an instance where the duplicates occur 3 (or more) times. However, in every case, I would always want to keep the last occurrence. Thus, following the example from above, I would like the final data set to look like
A 1
B 2
C 3
A 4
B 5
C 6
A 7
B 9
C 10
Is there an easy way to do this in R? Thank you in advance for your help!
Edit: 11/19/2014 12:14 PM EST
There was a solution posted by user Akron (spelling?) that has since gotten deleted. I am now sure why because it seemed to work for me?
The solution was
df = df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
It seems to work for me, why did it get deleted? For example, in cases with more than 2 consecutive duplicates:
df = data.frame(x = c("A", "B", "B", "B", "C", "C", "C", "A", "B", "C", "A", "B", "B", "C"), y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
x y
1 A 1
2 B 2
3 B 3
4 B 4
5 C 5
6 C 6
7 C 7
8 A 8
9 B 9
10 C 10
11 A 11
12 B 12
13 B 13
14 C 14
> df = df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
> df
x y
1 A 1
4 B 4
7 C 7
8 A 8
9 B 9
10 C 10
11 A 11
13 B 13
14 C 14
This seems to work?
Try
df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
# x y
#1 A 1
#2 B 2
#3 C 3
#4 A 4
#5 B 5
#6 C 6
#7 A 7
#9 B 9
#10 C 10
Explanation
Here, we are comparing an element with the element preceding it. This can be done by removing the first element from the column and that column compared with the column from which last element is removed (so that the lengths become equal)
df$x[-1] #first element removed
#[1] B C A B C A B B C
df$x[-nrow(df)]
#[1] A B C A B C A B B #last element `C` removed
df$x[-1]!=df$x[-nrow(df)]
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
In the above, the length is 1 less than the nrow of df as we removed one element. Inorder to compensate that, we can concatenate a TRUE and then use this index for subsetting the dataset.
Here's an rle solution:
df[cumsum(rle(as.character(df$x))$lengths), ]
# x y
# 1 A 1
# 2 B 2
# 3 C 3
# 4 A 4
# 5 B 5
# 6 C 6
# 7 A 7
# 9 B 9
# 10 C 10
Explanation:
RLE stands for Run Length Encoding. It produces a list of vectors. One being the runs, the values, and the other lengths being the number of consecutive repeats of each value. For example, x <- c(3, 2, 2, 3) has a runs vector of c(3, 2, 3) and lengths c(1, 2, 1). In this example, the cumulative sum of the lengths produces c(1, 3, 4). Subset x with this vector and you get c(3, 2, 3). Note that the second element of the lengths vector is the third element of the vector and the last occurrence of 2 in that particular 'run'.
You could also try
df[c(diff(as.numeric(df$x)), 1) != 0, ]
In case x is of character class (rather than factor), try
df[c(diff(as.numeric(factor(df$x))), 1) != 0, ]
# x y
# 1 A 1
# 2 B 2
# 3 C 3
# 4 A 4
# 5 B 5
# 6 C 6
# 7 A 7
# 9 B 9
# 10 C 10

map (align) smaller to larger sequence in r

I have following framework dataset:
master <- data.frame (namest = c("A","B", "C","D", "E", "F"),
position =c( 0, 10, 20, 25, 30, 35))
master
namest position
1 A 0
2 B 10
3 C 20
4 D 25
5 E 30
6 F 35
This is bigger map (say road map) where there is name of place and position. Now in second survey we have smaller subsets (many, here just 3).
subset1 <- data.frame (namest = c("I", "A", "ii", "iii", "B"),
position = c(0, 10, 12, 14, 20))
subset1
namest position
1 I 0
2 A 10
3 ii 12
4 iii 14
5 B 20
subset2 <- data.frame (namest = c("E", "vii", "F"), position = c(0, 3,5))
subset2
namest position
1 E 0
2 vii 3
3 F 5
subset3 <- data.frame (namest = c("D", "vi", "v", "C", "iv"),
position = c(0, 2, 3, 5, 8))
subset3
namest position
1 D 0
2 vi 2
3 v 3
4 C 5
5 iv 8
You can see that each subsets have at two names that are common to master, for example D and C in subset3.
Now I want to combine these subsets to make more detailed master. Means that new namest will be positioned in new map. See that some of subset (see subset3) have reverse order compared to master.
Thus expected output is:
subsetalign <- data.frame(subsett = c(rep ("A-B", nrow(subset1)),
rep("C-D", nrow(subset3)),
rep("E-F", nrow(subset2))), namest = c(c("I", "A", "ii", "iii", "B"),
rev (c("D", "vi", "v", "C", "iv")),c("E", "vii", "F")),
position = c(subset1$position, rev (subset3$position), subset2$position))
subsetalign
subsett namest position
1 A-B I 0
2 A-B A 10
3 A-B ii 12
4 A-B iii 14
5 A-B B 20
6 C-D iv 8
7 C-D C 5
8 C-D v 3
9 C-D vi 2
10 C-D D 0
11 E-F E 0
12 E-F vii 3
13 E-F F 5
The output process can be visualized as (I do not mean to create such figure,at this point, just to explain better):
Edits:
It is not simiply rbind due to two things:
(a) The subset are ordered based on how their comman namest are arranged in master file.
For example subset1 (A-B) + subset3 (C-D) + subset2 (E-F), as the order in master is A-B-C-D-E-F
(b) Also if the subset have reverse order than master, they should be reversed.
In subset 3, the order of namest is "D"-"vi"-"v"-"C"-"iv", but in master D comes after C, so this sustet 3 should reversed before binding.
Suppose the subsets are in a list
subsets <- list(subset1, subset2, subset3)
The location of the anchors in the master are
idx <- lapply(subsets, function(x, y) match(x$namest, y$namest), master)
The orientation of each subset is
orientation <- sapply(idx, function(elt) unique(diff(elt[!is.na(elt)])))
And the position in the master is
position <- sapply(idx, function(elt) min(elt, na.rm=TRUE))
The subsets can be ordered subsets[order(position)], reversed if necessary
updt <- Map(function(elt, dir) {
if (dir == -1)
elt[rev(seq_len(nrow(elt))),]
else elt
}, subsets[order(position)], orientation[order(position)])
and rbinded together, do.call(rbind, updt). This is assuming that all intervals in master are represented exactly once.

Resources