In DGET function, how to use multiple cell ranges as search criteria? - formula

I am using the DGET function in LibreOffice. I have the first table as shown below (top). I want to make second table (bottom). I can use DGET function where Database is the cell range containing top table and Database Field is "Winner".
Is it possible to have different cell ranges in Search Criteria, so that for each cell in row for Case #1 can have separate formula with a different search criteria as given in the first row of bottom table?
If I have to use separate continuous cell ranges for search criteria, then there would be [n*Chances] cell ranges, where n=total number of cases (~150 in my case) and Chances = possible number of Chance# (50 in my case).
Case | Chance# | Winner
-------------------------
1 | 7 | Joe
1 | 9 | Emil
1 | 10 | Harry
1 | 11 | Kate
2 | 1 | Tom
2 | 3 | Jerry
2 | 4 | Mike
2 | 7 | John
Case |Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|
|="=1" |="=2" |="=3" |="=4" |="=5" |="=6" |="=7" |="=8" |="=9" |="=10" |="=11" | ---- |="=50"
1 | | | | | | | Joe | |Emil |Harry | Kate | ---- |
2 | Tom | |Jerry |Mike | | | John | | | | | ---- |

To do so, you need to change your approach, instead of using DGET, I'm using a rather more complex method:
Considering your example:
A B C D
1 # Case Chance# Winner
2 1 1 7 Joe
3 2 1 9 Emil
4 3 1 10 Harry
5 4 1 11 Kate
6 5 2 1 Tom
7 6 2 3 Jerry
8 7 2 4 Mike
9 8 2 7 John
10
11 Case\Chance# 1 2 3 4
12 1
13 2 Tom Jerry Mike
I use the following:
=IF(SUMPRODUCT(($B$2:$B$9=$A12)*($C$2:$C$9=B$11)*($A$2:$A$9))> 0,INDEX($D$2:$D$9,SUMPRODUCT(($B$2:$B$9=$A12)*($C$2:$C$9=B$11)*($A$2:$A$9))),"")
Let's ignore the IF, and focus on the real deal here:
First, Get the row that matches your condition, $B$2:$B$9=$A12 and $C$2:$C$9=B$11 will result in a TRUE/FALSE arrays, multiply them to get a 0/1 array with only a single 1 for the match, now multiply by the ID to get the row number in your table.
SUMPRODUCT will get you a single value (the row) from the result array.
Finally use index to retrieve the desired value.
The IF statement tests if a match do exist (SUMPRODUCT > 0), to filter out the cell with no match.

Related

Convert Multiple Choice Data to Numeric

I have data that looks like this:
+-------------+------------+------------------+-------------------+------------------+
| gender | age | income | ate_string_cheese | tech_familiarity |
+-------------+------------+------------------+-------------------+------------------+
| A. Female | D. 45-54 | B. $50K - $80K | B. Once or twice | A. Low |
| A. Female | C. 35-44 | A. $35K - $49K | B. Once or twice | B. Medium |
| B. Male | B. 25-34 | B. 50k - 79,999 | B. Once or twice | C. High |
| A. Female | A. 18-24 | D. $100k - $149k | B. Once or twice | B. Medium |
+-------------+------------+------------------+-------------------+------------------+
I want to try to find correlations between different observations. I need the values to be numerical. I'm wondering if there's an easy way to do this in R?
To be clear the result from above would look like this:
+--------+-----+--------+-------------------+------------------+
| gender | age | income | ate_string_cheese | tech_familiarity |
+--------+-----+--------+-------------------+------------------+
| 1 | 4 | 2 | 2 | 1 |
| 1 | 3 | 1 | 2 | 2 |
| 2 | 2 | 2 | 2 | 3 |
| 1 | 1 | 4 | 2 | 2 |
+--------+-----+--------+-------------------+------------------+
I'm assuming there must be a package for this, but I can't find the Google incantation that will conjure it. Please know that I'm a complete statistic newbie who's just poking around. So if you prod me for more details, I likely won't have an educated answer to return.
To answer your question about converting categorical data into numerical data in R:
You can convert character data into factor using as.factor()
factor returns an object of class "factor" which has a set of integer codes the length of x with a "levels" attribute of mode character.
Pros:
This will encode your data numerically with an attribute that maps the character value for reference.
Factors can be ordered which can capture important information about ordinal data (such as age bands in your case)
Cons:
Beware converting categorical data into numeric for the purposes of performing statistical analysis on the data. The numerical values are probably not on the interval or ratio scale for all questions, so taking things like the mean or difference between levels may not make sense. e.g. consider if the distance between each level is actually constant, does it have a natural zero point etc.
You need to just extract first character, convert it to lowercase and map it with number:
# Your original data frame
df=read.table(text="gender;age;income;ate_string_cheese;tech_familiarity
A. Female;D.45-54;B.$50K - $80K;B.Once or twice;A.Low
A. Female;C.35-44;A.$35K - $49K;B.Once or twice;B. Medium
B. Male;B.25-34;B.50k - 79,999;B.Once or twice;C. High
A. Female;A. 18-24;D.$100k - $149k;B.Once or twice;B. Medium",header=T,sep=";")
myLetters <- letters[1:26]
# Apply match function to df, convert to lowercase and map it with number
sapply(df, function(x) match(tolower(gsub("([A-Za-z]+).*", "\\1", x)), myLetters))
Output:
gender age income ate_string_cheese tech_familiarity
[1,] 1 4 2 2 1
[2,] 1 3 1 2 2
[3,] 2 2 2 2 3
[4,] 1 1 4 2 2
You could trim the whitepace, and just grab the A,B,C,D parts and call factor on each column with level=LETTERS[1:4] and labels=1:4.
structure(factor(sub('\\..*','',trimws(as.matrix(df))),labels=1:4),.Dim=dim(df),dimnames=dimnames(df))
gender age income ate_string_cheese tech_familiarity
1 1 4 2 2 1
2 1 3 1 2 2
3 2 2 2 2 3
4 1 1 4 2 2
This is a matrix. You can convert to a dataframe
We can convert the columns to factor and coerce it to numeric
df[] <- lapply(df, function(x) as.integer(factor(x)))

R data.table add new column with query for each row

I have 2 R data.tables in R like so:
first_table
id | first | trunc | val1
=========================
1 | Bob | Smith | 10
2 | Sue | Goldm | 20
3 | Sue | Wollw | 30
4 | Bob | Bellb | 40
second_table
id | first | last | val2
==============================
1 | Bob | Smith | A
2 | Bob | Smith | B
3 | Sue | Goldman | A
4 | Sue | Goldman | B
5 | Sue | Wollworth | A
6 | Sue | Wollworth | B
7 | Bob | Bellbottom | A
8 | Bob | Bellbottom | B
As you can see, the last names in the first table are truncated. Also, the combination of first and last name is unique in the first table, but not in the second. I want to "join" on the combination of first name and last name under the incredibly naive assumptions that
first,last uniquely defines a person
that truncation of the last name does not introduce ambiguity.
The result should look like this:
id | first | trunc | last | val1
=======================================
1 | Bob | Smith | Smith | 10
2 | Sue | Goldm | Goldman | 20
3 | Sue | Wollw | Wollworth | 30
4 | Bob | Bellb | Bellbottom | 40
Basically, for each row in table_1, I need to find a row that back fills the last name.
For Each Row in first_table:
Find the first row in second_table with:
matching first_name & trunc is a substring of last
And then join on that row
Is there an easy vectorized way to accomplish this with data.table?
One approach is to join on first, then filter based on the substring-match
first_table[
unique(second_table[, .(first, last)])
, on = "first"
, nomatch = 0
][
substr(last, 1, nchar(trunc)) == trunc
]
# id first trunc val1 last
# 1: 1 Bob Smith 10 Smith
# 2: 2 Sue Goldm 20 Goldman
# 3: 3 Sue Wollw 30 Wollworth
# 4: 4 Bob Bellb 40 Bellbottom
Or, do the truncation on the second_table to match the first, then join on both columns
first_table[
unique(second_table[, .(first, last, trunc = substr(last, 1, 5))])
, on = c("first", "trunc")
, nomatch = 0
]
## yields the same answer

treating overlapping states TraMineR

I'm using TraMineR and I'm trying to import a dataset and convert it from a SPELL format to STS format.
That's an example of my dataset (for the sake of simplicity I used numeric values instead of dates).
Alphabet=[a,b]
days=[1,2,3,4,5....]
id | start | end | values |
1 | 1 | 5 | a |
1 | 6 | 12 | a |
1 | 10 | 15 | b |
2 | 2 | 8 | b |
2 | 7 | 10 | a |
Defining the sequences in STS format, I'll have the following
id day1 day2 .........day9 day10 day11 day12 day13 day14.......
1 a a ......... a a a a b b .......
2 ........and so on
The problem is that if I have concomintant states, the last starts when the first ends as happened in my example between the second to third state for id 1.
How can I split states?
I.e. when the state a finishes then b starts from the beginning, just if overlapping is less then n days.
Or maybe can I define another states when a and b overlap for more than n days.
I.e.
id day1 day2 .........day9 day10 day11 day12 day13 day14.......
1 a a ......... a ab ab ab b b .......

Remove duplicates where values are swapped across 2 columns in R [duplicate]

This question already has answers here:
pair-wise duplicate removal from dataframe [duplicate]
(4 answers)
Closed 6 years ago.
I have a simple dataframe like this:
| id1 | id2 | location | comment |
|-----|-----|------------|-----------|
| 1 | 2 | Alaska | cold |
| 2 | 1 | Alaska | freezing! |
| 3 | 4 | California | nice |
| 4 | 5 | Kansas | boring |
| 9 | 10 | Alaska | cold |
The first two rows are duplicates because id1 and id2 both went to Alaska. It doesn't matter that their comment are different.
How can I remove one of these duplicates -- either one would be fine to remove.
I was first trying to sort id1 and id2, then get the index where they are duplicated, then go back and use the index to subset the original df. But I can't seem to pull this off.
df <- data.frame(id1 = c(1,2,3,4,9), id2 = c(2,1,4,5,10), location=c('Alaska', 'Alaska', 'California', 'Kansas', 'Alaska'), comment=c('cold', 'freezing!', 'nice', 'boring', 'cold'))
We can use apply with MARGIN=1 to sort by row for the 'id' columns, cbind with 'location' and then use duplicated to get a logical index that can be used for removing/keeping the rows.
df[!duplicated(data.frame(t(apply(df[1:2], 1, sort)), df$location)),]
# id1 id2 location comment
#1 1 2 Alaska cold
#3 3 4 California nice
#4 4 5 Kansas boring
#5 9 10 Alaska cold

How to add certain elements of column in Matrix in R?

N* [1]| [2] | [3]
1* | 3 | 20 | 3 |
2* | 2 | 10 | 3 |
3* | 3 | 25 | 3 |
4* | 1 | 15 | 3 |
5* | 3 | 30 | 3 |
Can you help me to get a sum of second column, but only sum of elements that has 3 in the first row. For example in that matrix it is 20+25+30=75. In a fastest way (it's actually big matrix).
P.S. I tried something like this with(Train, sum(Column2[,"Date"] == i))
As you can see I need sum Of Colomn2 where date has certain meaning (from 1 to 12)
We can create a logical index with the first column and use that to subset the second column and get the sum
sum(m1[m1[,1]==3,2])
EDIT: Based on #Richard Scriven's comment.

Resources