Data transformation for machine learning - r

I have dataset with SKU IDs and their counts, i need to feed this data into a machine learning algorithm, in a way that SKU IDs become columns and COUNTs are at the intersection of transaction id and SKU ID. Can anyone suggest how to achieve this transformation.
CURRENT DATA
TransID SKUID COUNT
1 31 1
1 32 2
1 33 1
2 31 2
2 34 -1
DESIRED DATA
TransID 31 32 33 34
1 1 2 1 0
2 2 0 0 -1

In R, we can use either xtabs
xtabs(COUNT~., df1)
# SKUID
#TransID 31 32 33 34
# 1 1 2 1 0
# 2 2 0 0 -1
Or dcast
library(reshape2)
dcast(df1, TransID~SKUID, value.var="COUNT", fill=0)
# TransID 31 32 33 34
#1 1 1 2 1 0
#2 2 2 0 0 -1
Or spread
library(tidyr)
spread(df1, SKUID, COUNT, fill=0)

In Pandas, you can use pivot:
>>> df.pivot('TransID', 'SKUID').fillna(0)
COUNT
SKUID 31 32 33 34
TransID
1 1 2 1 0
2 2 0 0 -1
To avoid ambiguity, it is best to explicitly label your variables:
df.pivot(index='TransID', columns='SKUID').fillna(0)
You can also perform a groupby and then unstack SKUID:
>>> df.groupby(['TransID', 'SKUID']).COUNT.sum().unstack('SKUID').fillna(0)
SKUID 31 32 33 34
TransID
1 1 2 1 0
2 2 0 0 -1

In GraphLab/SFrame, the relevant commands are unstack and unpack.
import sframe #or import graphlab
sf = sframe.SFrame({'TransID':[1, 1, 1, 2, 2],
'SKUID':[31, 32, 33, 31, 34],
'COUNT': [1, 2, 1, 2, -1]})
sf2 = sf.unstack(['SKUID', 'COUNT'], new_column_name='dict_counts')
out = sf2.unpack('dict_counts', column_name_prefix='')
The missing values can be filled by column:
for c in out.column_names():
out[c] = out[c].fillna(0)
out.print_rows()
+---------+----+----+----+----+
| TransID | 31 | 32 | 33 | 34 |
+---------+----+----+----+----+
| 1 | 1 | 2 | 1 | 0 |
| 2 | 2 | 0 | 0 | -1 |
+---------+----+----+----+----+

Related

How to filter out all rows of data frame after the last value of 1 in the column Z?

I have the following data frame:
| Y | Z |
-----------------
62 0
65 0
59 1
66 0
64 1
64 1
57 0
68 1
59 0
60 0
How can I filter out the Z column so, that all the "leftover values" after the final occurance of the value 1 will be filtered out (in this case all the zeroes after the last 1)? In the case of the above example, the filtered data frame would become:
| Y | Z |
-----------------
62 0
65 0
59 1
66 0
64 1
64 1
57 0
68 1
Also, how could I do the filtering for all the values before the first 1 (filter out all the values which precede it) ..?
You can delete all rows after the last occurrence of a value like this:
library(dplyr)
df %>%
slice(1:max(which(Z == 1)))
Output:
Y Z
1 62 0
2 65 0
3 59 1
4 66 0
5 64 1
6 64 1
7 57 0
8 68 1
Another possible solution:
library(dplyr)
df %>%
filter(!(Z == 0 & data.table::rleid(Z) %>% "%in%"(c(1, max(.)))))
#> Y Z
#> 1 59 1
#> 2 66 0
#> 3 64 1
#> 4 64 1
#> 5 57 0
#> 6 68 1

Conditional Statements: selecting/assigning a variable per row

I have a data set with 2 VPs and 350 interval values for each. I am writing an if loop to select when a minimum value of VP1 overlaps with the maximum value of VP2.
The data usually sorts by VP, but I arranged to sort by minimum since it is a timeframe.
I ran the following code that worked to assign 0 or 1 when the values overlap the previous item, but it does not account for what the previous item is (ie. whether the previous item is VP1 or VP2).
for (i in 2:length(df$newvariable)) {
if (df$minimum[i] < df$maximum[i-1]){
df$newvariable[i] <- 0
} else {
df$newvariable[i] <- 1
}
}
I want to say if df$minimum[i] of VP1 < df$maximum[i] of VP2, then df$newvariable = 0. Otherwise, df$newvariable = 1.
I have not been able to find how to make it conditional per each row and loop again. Does anyone have any recommendations?
Many thanks.
Sample Data:
VP xmin xmax
1 0 6
2 0 2
2 6 14
1 14 24
2 20 30
1 30 36
... And so on for 600 or so rows.
Desired Output:
VP xmin xmax newvariable
1 0 6 -
2 0 2 0
2 6 14 1
1 14 24 1
2 20 30 0
1 30 36 1
If I have a dataframe that had another variable and I subsetted to only look at one part of the variable. For example, variable = talking and the assignments are 1 (yes) or 0 (no). I originally subsetted to just look at 0 and create new variables, like quiet_together. However, I want to put these dataframes back together but have added columns in the separate dataframes. If I want the same exact thing as described above but with the dataframe together (instead of 2 separate ones), how would I specify for the each assigned variable? I want to end up with two new columns based on xmin and xmax values while accounting for the value in the talking variable. The new columns would be talk_together (for the 1 value of the talking variable) and quiet_together (for the 0 value of the talking variable, when xmin <= xmax for the previous line.
For example:
Sample Data:
VP xmin xmax talking
1 0 6 0
2 0 2 0
2 2 6 1
2 6 14 0
1 6 14 1
2 14 24 1
1 14 20 0
1 20 30 1
2 24 32 0
1 30 32 0
... And so on for 600 or so rows.
Desired Output:
VP xmin xmax talking talk_together quiet_together
1 0 6 0 0 0
2 0 2 0 0 0
2 2 6 1 0 0
2 6 14 0 0 0
1 6 14 1 0 0
1 14 20 0 0 0
2 14 24 1 1 0
1 20 30 1 1 0
2 24 32 0 0 1
1 30 32 0 0 1
You could use lag from dplyr to compare with previous xmax value.
library(dplyr)
df %>% mutate(newvariable = as.integer(xmin >= lag(xmax)))
# VP xmin xmax newvariable
#1 1 0 6 NA
#2 2 0 2 0
#3 2 6 14 1
#4 1 14 24 1
#5 2 20 30 0
#6 1 30 36 1
Or shift with data.table
library(data.table)
setDT(df)[, newvariable := +(xmin >= shift(xmax))]
Base R alternatives are :
df$newvariable <- as.integer(c(NA, df$xmin[-1] >= df$xmax[-nrow(df)]))
and
df$newvariable <- +c(NA, tail(df$xmin, -1) >= head(df$xmax, -1))
With data.table, we can do
library(data.table)
setDT(df)[, newvariable := as.integer(xmin >= shift(xmax))]

R: apply/ lapply: How to Create a bar chart if all entries in on column are 1's?

imagine, you have the following data set:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 0 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
Further, imagine you want to compile summary tables that print out the frequencies of those that drink wine, beer, water.
I solved it that way.
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
This allows me to complete my ultimate goal of compiling a bar chart in the way I want it:
barplot(con_P)
It works perfectly. No problem. Now, let us tweak the data set as follows: We set all entries for water to 1.
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
If I now run the following commands:
con<-apply(df[,c(2:4)], 2, table)
con_P<-prop.table(con,2)
it gives me the following error message after the second line: Error in margin.table(x, margin) : 'x' is not an array!
Through another question here on this forum, I learned that the following will help me to overcome this issue:
con_P <- lapply(con, function(x) x/sum(x))
However, if I now run
barplot(con_P)
R does not create a barplot: Error in -0.01 * height : non-numeric argument to binary operator. I assume it is because it is no array!
My question is what to do now (how would I transform con_P in th second example into an array?). Secondly, how can I make the entire step of creating prop.tables and then a bar chart more efficient? Any help is much appreciated.
We can by converting the columns to factor with levels specified. In the second example, as the columns have 0 and 1 values in the 2nd and 3rd, we use the levels as 0:1, then get the table and convert to proportion with prop.table. and do the barplot
barplot(prop.table(sapply(df[2:4],
function(x) table(factor(x, levels=0:1))),2))
Reproducing your data:
df<-data.frame(read.table(header = TRUE, text = "
ID Wine Beer Water Age Gender
1 0 1 1 20 Male
2 1 0 1 38 Female
3 0 0 1 32 Female
4 1 0 1 30 Male
5 1 1 1 30 Male
6 1 1 1 26 Female
7 0 1 1 36 Female
8 0 1 1 29 Male
9 0 1 1 33 Female
10 0 1 1 20 Female"))
con <-lapply(df[,c(2:4)], table)
con_P <- lapply(con, function(x) x/sum(x))
You can use reshape2 to melt the data:
library(reshape2)
df <- melt(con_P)
Now, if you want to use gpplot2 you can use df to plot the bar plot:
ggplot(df, aes(x = L1, y = value, fill = factor(Var1) )) +
geom_bar(stat= "identity") +
theme_bw()
If you want to use barplot you can reshape the data.frame into an array:
array <- acast( df, Var1~L1)
array[is.na(array)] <- 0
barplot(array)

R - Change row values based on the contents of neighbouring rows

I have a series of numbers in two columns, with the titles "a" and "b".
I want to get R to change the values in column "b" if the difference between a value in column "a" is greater than 10 from its neighboring cells.
For example:
a | b
-----------
1 | 1
2 | 1
3 | 1
4 | 1
21 | 1
22 | 1
23 | 1
24 | 1
... | ...
Then I would like R to change the values in column "b" to
a | b
-----------
1 | 1
2 | 1
3 | 1
4 | 0
21 | 0
22 | 1
23 | 1
24 | 1
... | ...
Because the values 4 and 21 in the a-column are greater than 10 from each other.
Any help would be greatly appreciated.
df <- data.frame(a = c(1:4, 21:24), b = 1)
# check whether differences are greater than 10
diffs <- diff(df$a) > 10
# create `b`
df$b <- as.integer(!(c(FALSE, diffs) | c(diffs, FALSE)))
The result:
a b
1 1 1
2 2 1
3 3 1
4 4 0
5 21 0
6 22 1
7 23 1
8 24 1
Some alternative.
df <- data.frame(a = c(1:4, 21:24), b = 1L)
local({
w10 <- with(df, which(diff(a) > 10)))
df$b[c(w10, w10+1)] <<- 0L
})

Combining 2 columns into 1 column many times in a very large dataset in R

Combining 2 columns into 1 column many times in a very large dataset in R
The clumsy solutions I am working on are not going to be very fast if I can get them to work and the true dataset is ~1500 X 45000 so they need to be fast. I definitely at a loss for 1) at this point although have some code for 2) and 3).
Here is a toy example of the data structure:
pop = data.frame(status = rbinom(n, 1, .42), sex = rbinom(n, 1, .5),
age = round(rnorm(n, mean=40, 10)), disType = rbinom(n, 1, .2),
rs123=c(1,3,1,3,3,1,1,1,3,1), rs123.1=rep(1, n), rs157=c(2,4,2,2,2,4,4,4,2,2),
rs157.1=c(4,4,4,2,4,4,4,4,2,2), rs132=c(4,4,4,4,4,4,4,4,2,2),
rs132.1=c(4,4,4,4,4,4,4,4,4,4))
Thus, there are a few columns of basic demographic info and then the rest of the columns are biallelic SNP info. Ex: rs123 is allele 1 of rs123 and rs123.1 is the second allele of rs123.
1) I need to merge all the biallelic SNP data that is currently in 2 columns into 1 column, so, for example: rs123 and rs123.1 into one column (but within the dataset):
11
31
11
31
31
11
11
11
31
11
2) I need to identify the least frequent SNP value (in the above example it is 31).
3) I need to replace the least frequent SNP value with 1 and the other(s) with 0.
Do you mean 'merge' or 'rearrange' or simply concatenate? If it is the latter then
R> pop2 <- data.frame(pop[,1:4], rs123=paste(pop[,5],pop[,6],sep=""),
+ rs157=paste(pop[,7],pop[,8],sep=""),
+ rs132=paste(pop[,9],pop[,10], sep=""))
R> pop2
status sex age disType rs123 rs157 rs132
1 0 0 42 0 11 24 44
2 1 1 37 0 31 44 44
3 1 0 38 0 11 24 44
4 0 1 45 0 31 22 44
5 1 1 25 0 31 24 44
6 0 1 31 0 11 44 44
7 1 0 43 0 11 44 44
8 0 0 41 0 11 44 44
9 1 1 57 0 31 22 24
10 1 1 40 0 11 22 24
and now you can do counts and whatnot on pop2:
R> sapply(pop2[,5:7], table)
$rs123
11 31
6 4
$rs157
22 24 44
3 3 4
$rs132
24 44
2 8
R>

Resources