Relabel of rowname column in R dataframe - r

When I bind multiple dataframes together using Out2 = do.call(rbind.data.frame, Out), I obtain the following output. How do I relabel the first column such that it only contains the numbers within the square brackets, i.e. 1 to 5 for each trial number? Is there a way to add a column name to the first column too?
| V1 | V2 | Trial |
+--------+--------------+--------------+-------+
| [1,] | 0.130880519 | 0.02085533 | 1 |
| [2,] | 0.197243133 | -0.000502744 | 1 |
| [3,] | -0.045241653 | 0.106888902 | 1 |
| [4,] | 0.328759949 | -0.106559163 | 1 |
| [5,] | 0.040894969 | 0.114073454 | 1 |
| [1,]1 | 0.103130056 | 0.013655756 | 2 |
| [2,]1 | 0.133080106 | 0.038049071 | 2 |
| [3,]1 | 0.067975054 | 0.03036033 | 2 |
| [4,]1 | 0.132437217 | 0.022887103 | 2 |
| [5,]1 | 0.124950463 | 0.007144698 | 2 |
| [1,]2 | 0.202996317 | 0.004181205 | 3 |
| [2,]2 | 0.025401354 | 0.045672932 | 3 |
| [3,]2 | 0.169469266 | 0.002551237 | 3 |
| [4,]2 | 0.2303046 | 0.004936579 | 3 |
| [5,]2 | 0.085702254 | 0.020814191 | 3 |
+--------+--------------+--------------+-------+

We can use parse_number to extract the first occurence of numbers
library(dplyr)
df1 %>%
mutate(newcol = readr::parse_number(row.names(df1)))
Or in base R, use sub to capture the digits after the [ in the row names
df1$newcol <- sub("^\\[(\\d+).*", "\\1", row.names(df1))

Related

How to remove empty cells and reduce columns

I have a table, that looks roughly like this:
| variable | observer1 | observer2 | observer3 | final |
| -------- | --------- | --------- | --------- | ----- |
| case1 | | | | |
| var1 | 1 | 1 | | |
| var2 | 3 | 3 | | |
| var3 | 4 | 5 | | 5 |
| case2 | | | | |
| var1 | 2 | | 2 | |
| var2 | 5 | | 5 | |
| var3 | 1 | | 1 | |
| case3 | | | | |
| var1 | | 2 | 3 | 2 |
| var2 | | 2 | 2 | |
| var3 | | 1 | 1 | |
| case4 | | | | |
| var1 | 1 | | 1 | |
| var2 | 5 | | 5 | |
| var3 | 3 | | 3 | |
Three colums for the observers, but only two are filled.
First I want to compute the IRR, so I need a table that has two columns without the empty cells like this:
| variable | observer1 | observer2 |
| -------- | --------- | --------- |
| case1 | | |
| var1 | 1 | 1 |
| var2 | 3 | 3 |
| var3 | 4 | 5 |
| case2 | | |
| var1 | 2 | 2 |
| var2 | 5 | 5 |
| var3 | 1 | 1 |
| case3 | | |
| var1 | 2 | 3 |
| var2 | 2 | 2 |
| var3 | 1 | 1 |
| case4 | | |
| var1 | 1 | 1 |
| var2 | 5 | 5 |
| var3 | 3 | 3 |
I try to use the tidyverse packages, but I'm not sure. Some 'ifelse()' magic may be easier.
Is there a clean and easy method to do something like this? Can anybody point me to the right function to use? Or just to a keyword to search for on stackoverflow? I found a lot of methods to remove whole empty columns or rows.
Edit: I removed the link to the original data. It was unnecessary. Thanks to Lamia for his working answer.
Out of your 3 columns observer1, observer2 and observer3, you sometimes have 2 non-NA values, 1 non-NA value, or 3 NA values.
If you want to merge your 3 columns, you could do:
res = data.frame(df$coding,t(apply(df[paste0("observer",1:3)],1,function(x) x[!is.na(x)][1:2])))
The apply function will return for each row the 2 non-NA values if there are 2, one non-NA value and one NA if there is only one value, and two NAs if there is no data in the row.
We then put this result in a dataframe with the first column (coding).

Split data based on grouping column

I'm trying to work out how, in Azure ML (and therefore R solutions are acceptable), to randomly split data based on a column, such that all records with any given value in that column wind up in one side of the split or another. For example:
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 1234 | 1 | Foo | 1 |
| 5678 | 0 | Bar | 1 |
| 9101112 | 1 | Quack | 1 |
| 13141516 | 1 | Meep | 1 |
| 1234 | 0 | Boop | 2 |
| 5678 | 0 | Baa | 2 |
| 9101112 | 0 | Bleat | 2 |
| 13141516 | 1 | Maaaa | 2 |
| 1234 | 0 | Foo | 3 |
| 5678 | 0 | Bar | 3 |
| 9101112 | 1 | Quack | 3 |
| 13141516 | 1 | Meep | 3 |
| 1234 | 1 | Boop | 4 |
| 5678 | 1 | Baa | 4 |
| 9101112 | 0 | Bleat | 4 |
| 13141516 | 1 | Maaaa | 4 |
+------------+------+--------------------+------+
Acceptable output from that if I chose, say, a 50/50 split and to be grouped based on the Student ID column would be two new datasets:
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 1234 | 1 | Foo | 1 |
| 1234 | 0 | Boop | 2 |
| 1234 | 0 | Foo | 3 |
| 1234 | 1 | Boop | 4 |
| 9101112 | 1 | Quack | 1 |
| 9101112 | 0 | Bleat | 2 |
| 9101112 | 1 | Quack | 3 |
| 9101112 | 0 | Bleat | 4 |
+------------+------+--------------------+------+
and
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 5678 | 0 | Bar | 1 |
| 5678 | 0 | Baa | 2 |
| 5678 | 0 | Bar | 3 |
| 5678 | 1 | Baa | 4 |
| 13141516 | 1 | Meep | 1 |
| 13141516 | 1 | Maaaa | 2 |
| 13141516 | 1 | Meep | 3 |
| 13141516 | 1 | Maaaa | 4 |
+------------+------+--------------------+------+
Now, from what I can tell this is basically the opposite of stratified split, where it would get a random sample with every student represented on both sides.
I would prefer an Azure ML function that did this, but I think that's unlikely so is there an R function or library that gives this kind of functionality? All I could find was questions about stratification which obviously don't help me much.
You can use te following command:
data.fold <- mutate(df, fold = sample(rep_len(1:2, n_distinct(Student ID)))[Student ID])
It returns the original dataframe with an new column that indicates the fold that the student is in. If you want more folds, just adjust the '1:2' part.
I've tried the 'sample unique' way but it did not always work for me in the past.

How do I analyze Market Basket Output?

I have a sale data as below:
+------------+------+-------+
| Receipt ID | Item | Value |
+------------+------+-------+
| 1 | a | 2 |
| 1 | b | 3 |
| 1 | c | 2 |
| 1 | k | 4 |
| 2 | a | 2 |
| 2 | b | 5 |
| 2 | d | 6 |
| 2 | k | 7 |
| 3 | a | 8 |
| 3 | k | 1 |
| 3 | c | 2 |
| 3 | q | 3 |
| 4 | k | 4 |
| 4 | a | 5 |
| 5 | b | 6 |
| 5 | a | 7 |
| 6 | a | 8 |
| 6 | b | 3 |
| 6 | c | 4 |
+------------+------+-------+
Using APriori algorithm, I modified the Rules into different columns:
For eg, I got output as below, I trimmed support, confidence, Lift value.. I am only considering rules which mapped into different columns into Target Item, Item1, Items ({Item1,Item2} -> {Target Item})
Output is as below:
+-------------+-------+-------+
| Target Item | Item1 | Item2 |
+-------------+-------+-------+
| a | b | |
| a | b | c |
| a | k | |
+-------------+-------+-------+
I am looking to calculate the all the receipts having the rules combination and identify the Target item Sale value only in those receipts and also Combined sale value of Item 1 and Item 2 in the combination receipts:
Output should be something like below (I dont need receipt ID's from below)
+-------------+-------+-------+--------------+----------------------+------------------------------+
| Target Item | Item1 | Item2 | Receipt ID's | Value of Target Item | Remaining value(Item1+item2) |
+-------------+-------+-------+--------------+----------------------+------------------------------+
| a | b | | 1,2,5,6 | 2+2+7+8 | 3+5+6+3 |
| a | b | c | 1,6 | 2 | (3+3) + (2+4) |
| a | k | | 1,2,3,4 | 2+2+8+5 | 4+7+1+4 |
+-------------+-------+-------+--------------+----------------------+------------------------------+
To replicate the Apriori:
library(arules)
Data <- data.frame(
Receipt_ID = c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,5,5,6,6,6),
item = c('a','b','c','k','a','b','d','k','a','k','c','q','k', 'a','b','a','a', 'b', 'c'
)
,
value = c(2,3,2,4,2,5,6,7,8,1,2,3,4,5,6,7,8,3,4
)
)
write.table(Data,"item.csv",sep=',',row.names = F)
data_frame = read.transactions(
file = "item.csv",
format = "single",
sep = ",",
cols = c("Receipt_ID","item"),
rm.duplicates = T
)
rules_apriori <- apriori(data_frame)
rules_apriori
rules_tab <- as(rules_apriori, "data.frame")
rules_tab
out <- strsplit(as.character(rules_tab$rules),'=>')
rules_tab$rhs <- do.call(rbind, out)[,2]
rules_tab$lhs <- do.call(rbind, out)[,1]
rules_tab$rhs <- gsub("\\{", "", rules_tab$rhs)
rules_tab$rhs <- gsub("}", "", rules_tab$rhs)
rules_tab$lhs = gsub("}", "", rules_tab$lhs)
rules_tab$lhs = gsub("\\{", "", rules_tab$lhs)
rules_final <- data.frame (target_item = character(),item_combination = character() )
rules_final <- cbind(target_item = rules_tab$rhs,item_Combination = rules_tab$lhs)
rules_final

How to subset a dataframe using a column from another dataframe in r?

I have 2 dataframes
Dataframe1:
| Cue | Ass_word | Condition | Freq | Cue_Ass_word |
1 | ACCENDERE | ACCENDINO | A | 1 | ACCENDERE_ACCENDINO
2 | ACCENDERE | ALLETTARE | A | 0 | ACCENDERE_ALLETTARE
3 | ACCENDERE | APRIRE | A | 1 | ACCENDERE_APRIRE
4 | ACCENDERE | ASCENDERE | A | 1 | ACCENDERE_ASCENDERE
5 | ACCENDERE | ATTIVARE | A | 0 | ACCENDERE_ATTIVARE
6 | ACCENDERE | AUTO | A | 0 | ACCENDERE_AUTO
7 | ACCENDERE | ACCENDINO | B | 2 | ACCENDERE_ACCENDINO
8 | ACCENDERE| ALLETTARE | B | 3 | ACCENDERE_ALLETTARE
9 | ACCENDERE| ACCENDINO | C | 2 | ACCENDERE_ACCENDINO
10 | ACCENDERE| ALLETTARE | C | 0 | ACCENDERE_ALLETTARE
Dataframe2:
| Group.1 | x
1 | ACCENDERE_ACCENDINO | 5
13 | ACCENDERE_FUOCO | 22
16 | ACCENDERE_LUCE | 10
24 | ACCENDERE_SIGARETTA | 6
....
I want to exclude from Dataframe1 all the rows that contain words (Cue_Ass_word) that are not reported in the column Group.1 in Dataframe2.
In other words, how can I subset Dataframe1 using the strings reported in Dataframe2$Group.1?
It's not quite clear what you mean, but is this what you need?
Dataframe1[!(Dataframe1$Cue_Ass_word %in% Dataframe2$Group1),]

Extracting columns from text file

I load a text file (tree.txt) to R, with the below content (copy pasted from JWEKA - J48 command).
I use the following command to load the text file:
data3 <-read.table (file.choose(), header = FALSE,sep = ",")
I would like to insert each column into a separate variables named like the following format COL1, COL2 ... COL8 (in this example since we have 8 columns). If you load it to EXCEL with delimited separation each row will be separated in one column (this is the required result).
Each COLn will contain the relevant characters of the tree in this example.
How can separate and insert the text file into these columns automatically while ignoring the header and footer content of the file?
Here is the text file content:
[[1]]
J48 pruned tree
------------------
MSTV <= 0.4
| MLTV <= 4.1: 3 -2
| MLTV > 4.1
| | ASTV <= 79
| | | b <= 1383:00:00 2 -18
| | | b > 1383
| | | | UC <= 05:00 1 -2
| | | | UC > 05:00 2 -2
| | ASTV > 79:00:00 3 -2
MSTV > 0.4
| DP <= 0
| | ALTV <= 09:00 1 (170.0/2.0)
| | ALTV > 9
| | | FM <= 7
| | | | LBE <= 142:00:00 1 (27.0/1.0)
| | | | LBE > 142
| | | | | AC <= 2
| | | | | | e <= 1058:00:00 1 -5
| | | | | | e > 1058
| | | | | | | DL <= 04:00 2 (9.0/1.0)
| | | | | | | DL > 04:00 1 -2
| | | | | AC > 02:00 1 -3
| | | FM > 07:00 2 -2
| DP > 0
| | DP <= 1
| | | UC <= 03:00 2 (4.0/1.0)
| | | UC > 3
| | | | MLTV <= 0.4: 3 -2
| | | | MLTV > 0.4: 1 -8
| | DP > 01:00 3 -8
Number of Leaves : 16
Size of the tree : 31
An example of the COL1 content will be:
MSTV
|
|
|
|
|
|
|
|
MSTV
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
COL2 content will be:
MLTV
MLTV
|
|
|
|
|
|
>
DP
|
|
|
|
|
|
|
|
|
|
|
|
DP
|
|
|
|
|
|
Try this:
cleaned.txt <- capture.output(cat(paste0(tail(head(readLines("FILE_LOCATION"), -4), -4), collapse = '\n'), sep = '\n'))
cleaned.df <- read.fwf(file = textConnection(cleaned.txt),
header = FALSE,
widths = rep.int(4, max(nchar(cleaned.txt)/4)),
strip.white= TRUE
)
cleaned.df <- cleaned.df[,colSums(is.na(cleaned.df))<nrow(cleaned.df)]
For the cleaning process, I end up using a combination of head and tail to remove the 4 spaces on the top and the bottom. There's probably a more efficient way to do this outside of R, but this isn't so bad. Generally, I'm just making the file readable to R.
Your file looks like a fixed-width file so I use read.fwf, and use textConnection() to point the function to the cleaned output.
Finally, I'm not sure how your data is actually structured, but when I copied it from stackoverflow, it pasted with a bunch of whitespace at the end of each line. I'm using some tricks to guess at how long the file is, and removing extraneous columns over here
widths = rep.int(4, max(nchar(cleaned.txt)/4))
cleaned.df <- cleaned.df[,colSums(is.na(cleaned.df))<nrow(cleaned.df)]
Next, I'm creating the data in the way you would like it structured.
for (i in colnames(cleaned.df)) {
assign(i, subset(cleaned.df, select=i))
assign(i, capture.output(cat(paste0(unlist(get(i)[get(i)!=""])),sep = ' ', fill = FALSE)))
}
rm(i)
rm(cleaned.df)
rm(cleaned.txt)
What this does is it creates a loop for each column header in your data frame.
From there it uses assign() to put all the data in each column into its' own data frame. In your case, they are named V1 through V15.
Next, it uses a combination of cat() and paste() with unlist() an capture.output() to concatenate your list into a single character vectors, for each of the data frames, so they are now character vectors, instead of data frames.
Keep in mind that because you wanted a space at each new character, I'm using a space as a separator. But because this is a fixed-width file, some columns are completely blank, which I'm removing using
get(i)[get(i)!=""]
(Your question said you wanted COL2 to be: MLTV MLTV | | | | | | > DP | | | | | | | | | | | | DP | | | | | |).
If we just use get(i), there will be a leading whitespace in the output.

Resources