Working with keys in Pandas Data Frame - r

In R, Setkey can be used to work with keys and i.e. my data table gets sorted automatically when using aggregation functions. The R-Command I use is:
setkey(myData, “Customer”)
Does Python/Pandas also work with keys and Is there an equivalent for the R-Command?
Thanks a lot.

R's data.table setkey() function, as far as I know, doesn't have a direct equivalent in Python. However, there are a few functions that replace this functionality. Note the inplace parameter for these functions. If you don't specify inplace=True, the underlying data is not changed unless you explicitly assign (e.g., `df = df.sort_values('a')
You can use the sort_values() function to sort your data on one or more columns.
import pandas as pd
df = pd.DataFrame({'a': [1,1,2,1,2,2,2],
'b': [1,1,0,2,4,1,5],
'c': [3,4,5,2,6,1,7]})
>>> df
a b c
0 1 1 3
1 1 1 4
2 2 0 5
3 1 2 2
4 2 4 6
5 2 1 1
6 2 5 7
>>> df.sort_values(['a', 'b'])
a b c
0 1 1 3
1 1 1 4
3 1 2 2
2 2 0 5
5 2 1 1
4 2 4 6
6 2 5 7
If you are performing aggregation on a column or series of columns, you can use the groupby() function. This is similar to the by operator in R's data.table.
>>> df.groupby(['a', 'b'])['c'].max()
a b
1 1 4
2 2
2 0 5
1 1
4 6
5 7
You can also set the index to be one or more columns using the set_index() function.
>>> df.set_index('a')
b c
a
1 1 3
1 1 4
2 0 5
1 2 2
2 4 6
2 1 1
2 5 7
# once the index is set, you reference rows on the new index.
df.set_index('a', inplace=True)
df.ix[1]
>>> df.ix[1]
b c
a
1 1 3
1 1 4
1 2 2

Related

Data merge with data.table for repeating unique values

I am trying two merge two columns in data table 'A' with another column in another data table 'B' which is the unique value of a column . I want to merge in such a way that for every unique combination of two variables in data table 'A' , we get all unique values of column in data table 'B' repeated.
I tried merge but it doesn't give me all the values.I also tried the automated recycling function in data.table but this also doesn't give me the result.
Input:
data.table A
X Y
1 1
1 2
1 3
2 1
3 1
4 4
4 5
5 6
data.table B
Z
1
2
Expected output
X Y Z
1 1 1
1 1 2
1 2 1
1 2 2
1 3 1
1 3 2
2 1 1
2 1 2
3 1 1
3 1 2
4 4 1
4 4 2
4 5 1
4 5 2
5 6 1
5 6 2
We can make use of crossing from tidyr
library(tidyr)
crossing(A, B)
# X Y Z
#1 1 1 1
#2 1 1 2
#3 1 2 1
#4 1 2 2
#5 1 3 1
#6 1 3 2
#7 2 1 1
#8 2 1 2
#9 3 1 1
#10 3 1 2
#11 4 4 1
#12 4 4 2
#13 4 5 1
#14 4 5 2
#15 5 6 1
#16 5 6 2
Or with merge from base R, but the order will be slightly different
merge(A, B)
To get the correct order, replace the arguments in reverse and then order the columns
merge(B, A)[c(names(A), names(B))]

Find minimal value for a multiple same keys in table [duplicate]

This question already has answers here:
Extract row corresponding to minimum value of a variable by group
(9 answers)
Closed 5 years ago.
I have a table which contains multiple rows of the different data for a key of multiple columns.
Table looks like this:
A B C
1 1 1 2
2 1 1 3
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
7 2 3 2
8 2 3 2
I also discovered how to remove all of the duplicate elements using unique command for multiple colums, so the data duplication is not a problem.
I would like to know how to for every key(columns A and B in example) in the table to find only the minimum value in third column(C column in table)
At the end table should look like this
A B C
1 1 1 2
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
Thanks for any help. It is really appreciated
In any question, feel free to ask
con <- textConnection(" A B C
1 1 1 2
2 1 1 3
3 2 1 4
4 1 2 4
5 2 2 3
6 2 3 1
7 2 3 2
8 2 3 2")
df <- read.table(con, header = T)
df[with(df, order(A, B, C)), ]
df[!duplicated(df[1:2]),]
# A B C
# 1 1 1 2
# 3 2 1 4
# 4 1 2 4
# 5 2 2 3
# 6 2 3 1

Create new variable based on the value of several other variables

So I have a data set that has multiple variables that I want to use to create a new variable. I have seen other questions like this that use the ifelse statement, but this would be extremely insufficient since the new variable is based on 32 other variables. The variables are coded with values of 1, 2, 3, or NA, and I am wanting the new variable to be coded as 1 if 2 or more of the 32 variables take on a value of 1, and 2 otherwise. Here is a small example of what I have been trying to do.
df <- data.frame(id = 1:10, v1 = c(1,2,2,2,3,NA,2,2,2,2), v2 = c(2,2,2,2,2,1,2,1,2,2),
v3 = c(1,2,2,2,2,3,2,2,2,2), v4 = c(2,2,2,2,2,1,2,2,2,3))
and the result I am looking for is this:
id v1 v2 v3 v4 new
1 1 1 2 1 2 1
2 2 2 2 2 2 2
3 3 2 2 2 2 2
4 4 2 2 2 2 2
5 5 3 2 2 2 1
6 6 NA 1 3 1 2
7 7 2 2 2 2 2
8 8 2 1 2 2 2
9 9 2 2 2 2 2
10 10 2 2 2 3 2
I have also tried using rowSums within the if else statement, but with the missing values this doesn't work for all observations unless I recode the NAs to another value which I want to avoid doing, and besides that I feel like there would be a much more efficient way of doing this.
I feel like it is likely that this question has been answered before, but I couldn't find anything on it. So help or direction to a previous answer would be appreciated.
It looks like you were very close to getting your desired output, but you were probably missing the na.rm = TRUE argument as part of your rowSums() call. This will remove any NAs before rowSums does its calculations.
Anyway, using your data frame from above, I created a new variable that counts the number of times 1 appears across the variables, while ignoring NA values. Note that I've subsetted the data to exclude the id column:
df$count <- rowSums(df[-1] == 1, na.rm = TRUE)
Then I created another variable using an ifelse statement that returns a 1 if the count is 2 or more or a 2 otherwise.
df$var <- ifelse(df$count >= 2, 1, 2)
The returned output:
id v1 v2 v3 v4 count var
1 1 1 2 1 2 2 1
2 2 2 2 2 2 0 2
3 3 2 2 2 2 0 2
4 4 2 2 2 2 0 2
5 5 3 2 2 2 0 2
6 6 NA 1 3 1 2 1
7 7 2 2 2 2 0 2
8 8 2 1 2 2 1 2
9 9 2 2 2 2 0 2
10 10 2 2 2 3 0 2
UPDATE / EDIT: As mentioned by Gregor in the comments, you can also just wrap the rowSums function in the ifelse statement for one line of code.

Generating large drawing lists in R

Say I have a list in R like so,
[1] 3 5 4 7
And I want to generate all "drawings" from this list, from 1 up to the value of each number. For example,
1 1 1 1
1 1 1 2
1 1 1 3
...
2 3 3 1
2 3 3 2
2 3 3 3
...
3 5 4 7
I know I have used rep() in the past to do something very similar, which works for lists of 2 or 3 numbers (i.e. something like 1 4 5), but I'm not sure how to generalize this here.
Thoughts?
As suggested in comments, use Map function to apply seq to elements of your vector, then use expand.grid to generate data.frame with Cartesian product of result's elements:
head(expand.grid(Map(seq,c(3,5,4,7))))
Var1 Var2 Var3 Var4
1 1 1 1 1
2 2 1 1 1
3 3 1 1 1
4 1 2 1 1
5 2 2 1 1
6 3 2 1 1

Creating a combination data.table in R

I would like to do the following:
A B
1 2
1 3
1 4
2 3
2 4
3 4
using data.table, but I am not sure how to exclude the already used numbers cumulatively.

Resources