How to subset a dataframe using a column from another dataframe in r?

How to subset a dataframe using a column from another dataframe in r? - r

I have 2 dataframes
Dataframe1:
| Cue | Ass_word | Condition | Freq | Cue_Ass_word |
1 | ACCENDERE | ACCENDINO | A | 1 | ACCENDERE_ACCENDINO
2 | ACCENDERE | ALLETTARE | A | 0 | ACCENDERE_ALLETTARE
3 | ACCENDERE | APRIRE | A | 1 | ACCENDERE_APRIRE
4 | ACCENDERE | ASCENDERE | A | 1 | ACCENDERE_ASCENDERE
5 | ACCENDERE | ATTIVARE | A | 0 | ACCENDERE_ATTIVARE
6 | ACCENDERE | AUTO | A | 0 | ACCENDERE_AUTO
7 | ACCENDERE | ACCENDINO | B | 2 | ACCENDERE_ACCENDINO
8 | ACCENDERE| ALLETTARE | B | 3 | ACCENDERE_ALLETTARE
9 | ACCENDERE| ACCENDINO | C | 2 | ACCENDERE_ACCENDINO
10 | ACCENDERE| ALLETTARE | C | 0 | ACCENDERE_ALLETTARE
Dataframe2:
| Group.1 | x
1 | ACCENDERE_ACCENDINO | 5
13 | ACCENDERE_FUOCO | 22
16 | ACCENDERE_LUCE | 10
24 | ACCENDERE_SIGARETTA | 6
....
I want to exclude from Dataframe1 all the rows that contain words (Cue_Ass_word) that are not reported in the column Group.1 in Dataframe2.
In other words, how can I subset Dataframe1 using the strings reported in Dataframe2$Group.1?

It's not quite clear what you mean, but is this what you need?
Dataframe1[!(Dataframe1$Cue_Ass_word %in% Dataframe2$Group1),]

Related

Relabel of rowname column in R dataframe

When I bind multiple dataframes together using Out2 = do.call(rbind.data.frame, Out), I obtain the following output. How do I relabel the first column such that it only contains the numbers within the square brackets, i.e. 1 to 5 for each trial number? Is there a way to add a column name to the first column too?
| V1 | V2 | Trial |
+--------+--------------+--------------+-------+
| [1,] | 0.130880519 | 0.02085533 | 1 |
| [2,] | 0.197243133 | -0.000502744 | 1 |
| [3,] | -0.045241653 | 0.106888902 | 1 |
| [4,] | 0.328759949 | -0.106559163 | 1 |
| [5,] | 0.040894969 | 0.114073454 | 1 |
| [1,]1 | 0.103130056 | 0.013655756 | 2 |
| [2,]1 | 0.133080106 | 0.038049071 | 2 |
| [3,]1 | 0.067975054 | 0.03036033 | 2 |
| [4,]1 | 0.132437217 | 0.022887103 | 2 |
| [5,]1 | 0.124950463 | 0.007144698 | 2 |
| [1,]2 | 0.202996317 | 0.004181205 | 3 |
| [2,]2 | 0.025401354 | 0.045672932 | 3 |
| [3,]2 | 0.169469266 | 0.002551237 | 3 |
| [4,]2 | 0.2303046 | 0.004936579 | 3 |
| [5,]2 | 0.085702254 | 0.020814191 | 3 |
+--------+--------------+--------------+-------+

We can use parse_number to extract the first occurence of numbers
library(dplyr)
df1 %>%
mutate(newcol = readr::parse_number(row.names(df1)))
Or in base R, use sub to capture the digits after the [ in the row names
df1$newcol <- sub("^\\[(\\d+).*", "\\1", row.names(df1))

How do you assign groups to larger groups dpylr

I would like to assign groups to larger groups in order to assign them to cores for processing. I have 16 cores.This is what I have so far
test<-data_extract%>%group_by(group_id)%>%sample_n(16,replace = TRUE)
This takes staples OF 16 from each group.
This is an example of what I would like the final product to look like (with two clusters),all I really want is for the same group id to belong to the same cluster as a set number of clusters
________________________________
balance | group_id | cluster|
454452 | a | 1 |
5450441 | a | 1 |
5444531 | b | 1 |
5404051 | b | 1 |
5404501 | b | 1 |
5404041 | b | 1 |
544251 | b | 1 |
254252 | b | 1 |
541254 | c | 2 |
54123254 | d | 1 |
542541 | d | 1 |
5442341 | e | 2 |
541 | f | 1 |
________________________________

test<-data%>%group_by(group_id)%>% mutate(group = sample(1:16,1))

How to remove empty cells and reduce columns

I have a table, that looks roughly like this:
| variable | observer1 | observer2 | observer3 | final |
| -------- | --------- | --------- | --------- | ----- |
| case1 | | | | |
| var1 | 1 | 1 | | |
| var2 | 3 | 3 | | |
| var3 | 4 | 5 | | 5 |
| case2 | | | | |
| var1 | 2 | | 2 | |
| var2 | 5 | | 5 | |
| var3 | 1 | | 1 | |
| case3 | | | | |
| var1 | | 2 | 3 | 2 |
| var2 | | 2 | 2 | |
| var3 | | 1 | 1 | |
| case4 | | | | |
| var1 | 1 | | 1 | |
| var2 | 5 | | 5 | |
| var3 | 3 | | 3 | |
Three colums for the observers, but only two are filled.
First I want to compute the IRR, so I need a table that has two columns without the empty cells like this:
| variable | observer1 | observer2 |
| -------- | --------- | --------- |
| case1 | | |
| var1 | 1 | 1 |
| var2 | 3 | 3 |
| var3 | 4 | 5 |
| case2 | | |
| var1 | 2 | 2 |
| var2 | 5 | 5 |
| var3 | 1 | 1 |
| case3 | | |
| var1 | 2 | 3 |
| var2 | 2 | 2 |
| var3 | 1 | 1 |
| case4 | | |
| var1 | 1 | 1 |
| var2 | 5 | 5 |
| var3 | 3 | 3 |
I try to use the tidyverse packages, but I'm not sure. Some 'ifelse()' magic may be easier.
Is there a clean and easy method to do something like this? Can anybody point me to the right function to use? Or just to a keyword to search for on stackoverflow? I found a lot of methods to remove whole empty columns or rows.
Edit: I removed the link to the original data. It was unnecessary. Thanks to Lamia for his working answer.

Out of your 3 columns observer1, observer2 and observer3, you sometimes have 2 non-NA values, 1 non-NA value, or 3 NA values.
If you want to merge your 3 columns, you could do:
res = data.frame(df$coding,t(apply(df[paste0("observer",1:3)],1,function(x) x[!is.na(x)][1:2])))
The apply function will return for each row the 2 non-NA values if there are 2, one non-NA value and one NA if there is only one value, and two NAs if there is no data in the row.
We then put this result in a dataframe with the first column (coding).

Split data based on grouping column

I'm trying to work out how, in Azure ML (and therefore R solutions are acceptable), to randomly split data based on a column, such that all records with any given value in that column wind up in one side of the split or another. For example:
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 1234 | 1 | Foo | 1 |
| 5678 | 0 | Bar | 1 |
| 9101112 | 1 | Quack | 1 |
| 13141516 | 1 | Meep | 1 |
| 1234 | 0 | Boop | 2 |
| 5678 | 0 | Baa | 2 |
| 9101112 | 0 | Bleat | 2 |
| 13141516 | 1 | Maaaa | 2 |
| 1234 | 0 | Foo | 3 |
| 5678 | 0 | Bar | 3 |
| 9101112 | 1 | Quack | 3 |
| 13141516 | 1 | Meep | 3 |
| 1234 | 1 | Boop | 4 |
| 5678 | 1 | Baa | 4 |
| 9101112 | 0 | Bleat | 4 |
| 13141516 | 1 | Maaaa | 4 |
+------------+------+--------------------+------+
Acceptable output from that if I chose, say, a 50/50 split and to be grouped based on the Student ID column would be two new datasets:
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 1234 | 1 | Foo | 1 |
| 1234 | 0 | Boop | 2 |
| 1234 | 0 | Foo | 3 |
| 1234 | 1 | Boop | 4 |
| 9101112 | 1 | Quack | 1 |
| 9101112 | 0 | Bleat | 2 |
| 9101112 | 1 | Quack | 3 |
| 9101112 | 0 | Bleat | 4 |
+------------+------+--------------------+------+
and
+------------+------+--------------------+------+
| Student ID | pass | some_other_feature | week |
+------------+------+--------------------+------+
| 5678 | 0 | Bar | 1 |
| 5678 | 0 | Baa | 2 |
| 5678 | 0 | Bar | 3 |
| 5678 | 1 | Baa | 4 |
| 13141516 | 1 | Meep | 1 |
| 13141516 | 1 | Maaaa | 2 |
| 13141516 | 1 | Meep | 3 |
| 13141516 | 1 | Maaaa | 4 |
+------------+------+--------------------+------+
Now, from what I can tell this is basically the opposite of stratified split, where it would get a random sample with every student represented on both sides.
I would prefer an Azure ML function that did this, but I think that's unlikely so is there an R function or library that gives this kind of functionality? All I could find was questions about stratification which obviously don't help me much.

You can use te following command:
data.fold <- mutate(df, fold = sample(rep_len(1:2, n_distinct(Student ID)))[Student ID])
It returns the original dataframe with an new column that indicates the fold that the student is in. If you want more folds, just adjust the '1:2' part.
I've tried the 'sample unique' way but it did not always work for me in the past.

Is there an sqlite function that can check if a field matches a certain value and return 0 or 1?

Consider the following sqlite3 table:
+------+------+
| col1 | col2 |
+------+------+
| 1 | 200 |
| 1 | 200 |
| 1 | 100 |
| 1 | 200 |
| 2 | 400 |
| 2 | 200 |
| 2 | 100 |
| 3 | 200 |
| 3 | 200 |
| 3 | 100 |
+------+------+
I'm trying to write a query that will select the entire table and return 1 if the value in col2 is 200, and 0 otherwise. For example:
+------+--------------------+
| col1 | SOMEFUNCTION(col2) |
+------+--------------------+
| 1 | 1 |
| 1 | 1 |
| 1 | 0 |
| 1 | 1 |
| 2 | 0 |
| 2 | 1 |
| 2 | 0 |
| 3 | 1 |
| 3 | 1 |
| 3 | 0 |
+------+--------------------+
What is SOMEFUNCTION()?
Thanks in advance...

In SQLite, boolean values are just integer values 0 and 1, so you can use the comparison directly:
SELECT col1, col2 = 200 AS SomeFunction FROM MyTable

Like described in Does sqlite support any kind of IF(condition) statement in a select you can use the case keyword.
SELECT col1,CASE WHEN col2=200 THEN 1 ELSE 0 END AS col2 FROM table1

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to subset a dataframe using a column from another dataframe in r? - r

It's not quite clear what you mean, but is this what you need? Dataframe1[!(Dataframe1$Cue_Ass_word %in% Dataframe2$Group1),]

Related

Relabel of rowname column in R dataframe

How do you assign groups to larger groups dpylr

How to remove empty cells and reduce columns

Split data based on grouping column

Is there an sqlite function that can check if a field matches a certain value and return 0 or 1?

Categories

Resources