N* [1]| [2] | [3]
1* | 3 | 20 | 3 |
2* | 2 | 10 | 3 |
3* | 3 | 25 | 3 |
4* | 1 | 15 | 3 |
5* | 3 | 30 | 3 |
Can you help me to get a sum of second column, but only sum of elements that has 3 in the first row. For example in that matrix it is 20+25+30=75. In a fastest way (it's actually big matrix).
P.S. I tried something like this with(Train, sum(Column2[,"Date"] == i))
As you can see I need sum Of Colomn2 where date has certain meaning (from 1 to 12)
We can create a logical index with the first column and use that to subset the second column and get the sum
sum(m1[m1[,1]==3,2])
EDIT: Based on #Richard Scriven's comment.
Related
I have 3 datasets with varying rows and columns. In the end result all rows should be there, and all non-overlapping (unique) columns should be there.
a <- data.frame(a=c(0,1,2), b=c(3,4,5), c=c(6,7,8))
b <- data.frame(a=c(9,10,11), c=c(12,13,14), d=c(15,16,17))
Needs to be
c <- data.frame(a=c(0,1,2,9,10,11), b=c(3,4,5,NA,NA,NA), c=c(6,7,8,12,13,14), d=c(NA,NA,NA,15,16,17)
But imagine that instead of having abcd, you have the whole alphabet 4 times. (edit: and you don't know which ones are overlapping names (such as a and a in a and b are overlapping)).
The default behavior of all the dplyr join commands is to join on all columns that both datasets have in common. As you want to keep all values, even when there is nothing to join on then you will want an outer join.
Probably something like the following:
output = input_df1 %>%
dplyr::outer_join(input_df2) %>%
dplyr::outer_join(input_df3)
This will begin by joining the first two dataframes using all columns they have in common. Then it will join on the third dataframe using all columns in common.
This assumes that wherever dataframes have the same columns they have the same entries. Consider the following example:
df1:
a | b | z
---+---+---
1 | NA| 7
df2:
a | b | Y
---+---+---
1 | 2 | 8
df1:
a | b | X
---+---+---
1 | 2 | 9
This will not produce
output:
a | b | z | y | x
---+---+---+---+---
1 | 2 | 7 | 8 | 9
Because the first table does not have the same b value, even though it has the same a value. Instead this will produce:
output:
a | b | z | y | x
---+---+---+---+---
1 | 2 | NA| 8 | 9
1 | NA| 7 | NA| NA
It you need to handle this type of case, you will have to put more effort into your joins. Perhaps start with:
colnames1 = colnames(input_df1)
colnames2 = colnames(input_df2)
common_colnames = colnames1[colnames1 %in% colnames2]
To get all common column names and decide from there how to join.
I have a simple database table with three columns: id, x, y. x and y are just the coordinates of points in a line. I want to using the SQLite Window function to partition the table using a sliding window of three rows, and then get the y value that is the furthest from the y value of the first coordinate (row) in the window.
An example:
| id | x | y |
|----|---|---|
| 1 | 1 | .5|
| 2 | 2 | .9|
| 3 | 3 | .7|
| 4 | 4 |1.1|
| 5 | 5 | 1 |
So the first partition would consist of:
| id | x | y |
|----|---|---|
| 1 | 1 | .5|
| 2 | 2 | .9|
| 3 | 3 | .7|
And the desired result would be:
| id | x | y | d |
|----|---|---|---|
| 1 | 1 | .5| .4|
| 2 | 2 | .9|
| 3 | 3 | .7|
Since the the window with id = 1 as the CURRENT ROW would have a maximum variation of .4; the maximum distance between the y value of the first row in the partition, .5, and .9, is .4.
The final expected result:
| id | x | y | d |
|----|---|---|---|
| 1 | 1 | .5| .4|
| 2 | 2 | .9| .2|
| 3 | 3 | .7| .4|
| 4 | 4 |1.1| .1|
| 5 | 5 | 1 | |
I've tried using a window function like: WINDOW win1 AS (ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING which gives me the correct window.
With the window defined, I tried doing something like:
SELECT
max(abs(y - first_value(y) OVER win1)) AS d
FROM t
WINDOW win1 AS (ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)
But I get an error for misuse of first_value.
I think the problem I have is this is not the proper approach to calculate over each row of a partition, but I could not find another solution or approach that matches what I am trying to do here.
For each row of your table you define a window starting from the current row up to the next 2 rows.
In your code y is the value in the current row and first_value() is the 1st value of y of the current window which is also the value of y of the current row.
So even if your code was syntactically correct the difference you calculate would always return 0.
It's easier to solve your problem with LEAD() window function:
WITH cte AS (
SELECT *,
LEAD(y, 1) OVER () AS y1,
LEAD(y, 2) OVER () AS y2
FROM tablename
)
SELECT
id, x, y,
MAX(ABS(y - y1), COALESCE(ABS(y - y2), 0)) d
FROM cte
See the demo.
Results:
id x y d
1 1 0.5 0.4
2 2 0.9 0.2
3 3 0.7 0.4
4 4 1.1 0.1
5 5 1.0
I have data that looks like this:
+-------------+------------+------------------+-------------------+------------------+
| gender | age | income | ate_string_cheese | tech_familiarity |
+-------------+------------+------------------+-------------------+------------------+
| A. Female | D. 45-54 | B. $50K - $80K | B. Once or twice | A. Low |
| A. Female | C. 35-44 | A. $35K - $49K | B. Once or twice | B. Medium |
| B. Male | B. 25-34 | B. 50k - 79,999 | B. Once or twice | C. High |
| A. Female | A. 18-24 | D. $100k - $149k | B. Once or twice | B. Medium |
+-------------+------------+------------------+-------------------+------------------+
I want to try to find correlations between different observations. I need the values to be numerical. I'm wondering if there's an easy way to do this in R?
To be clear the result from above would look like this:
+--------+-----+--------+-------------------+------------------+
| gender | age | income | ate_string_cheese | tech_familiarity |
+--------+-----+--------+-------------------+------------------+
| 1 | 4 | 2 | 2 | 1 |
| 1 | 3 | 1 | 2 | 2 |
| 2 | 2 | 2 | 2 | 3 |
| 1 | 1 | 4 | 2 | 2 |
+--------+-----+--------+-------------------+------------------+
I'm assuming there must be a package for this, but I can't find the Google incantation that will conjure it. Please know that I'm a complete statistic newbie who's just poking around. So if you prod me for more details, I likely won't have an educated answer to return.
To answer your question about converting categorical data into numerical data in R:
You can convert character data into factor using as.factor()
factor returns an object of class "factor" which has a set of integer codes the length of x with a "levels" attribute of mode character.
Pros:
This will encode your data numerically with an attribute that maps the character value for reference.
Factors can be ordered which can capture important information about ordinal data (such as age bands in your case)
Cons:
Beware converting categorical data into numeric for the purposes of performing statistical analysis on the data. The numerical values are probably not on the interval or ratio scale for all questions, so taking things like the mean or difference between levels may not make sense. e.g. consider if the distance between each level is actually constant, does it have a natural zero point etc.
You need to just extract first character, convert it to lowercase and map it with number:
# Your original data frame
df=read.table(text="gender;age;income;ate_string_cheese;tech_familiarity
A. Female;D.45-54;B.$50K - $80K;B.Once or twice;A.Low
A. Female;C.35-44;A.$35K - $49K;B.Once or twice;B. Medium
B. Male;B.25-34;B.50k - 79,999;B.Once or twice;C. High
A. Female;A. 18-24;D.$100k - $149k;B.Once or twice;B. Medium",header=T,sep=";")
myLetters <- letters[1:26]
# Apply match function to df, convert to lowercase and map it with number
sapply(df, function(x) match(tolower(gsub("([A-Za-z]+).*", "\\1", x)), myLetters))
Output:
gender age income ate_string_cheese tech_familiarity
[1,] 1 4 2 2 1
[2,] 1 3 1 2 2
[3,] 2 2 2 2 3
[4,] 1 1 4 2 2
You could trim the whitepace, and just grab the A,B,C,D parts and call factor on each column with level=LETTERS[1:4] and labels=1:4.
structure(factor(sub('\\..*','',trimws(as.matrix(df))),labels=1:4),.Dim=dim(df),dimnames=dimnames(df))
gender age income ate_string_cheese tech_familiarity
1 1 4 2 2 1
2 1 3 1 2 2
3 2 2 2 2 3
4 1 1 4 2 2
This is a matrix. You can convert to a dataframe
We can convert the columns to factor and coerce it to numeric
df[] <- lapply(df, function(x) as.integer(factor(x)))
The first thing is my English is basic. Sorry.
Second thing, and the most important here: I can't find the way to do a simple query. My table is like this:
------------------------------------------
id_det_iti | id_iti | orden_iti| id_ciudad
--------------------------------------------
1 | 1 | 1 | 374
2 | 1 | 2 | 25
3 | 1 | 3 | 241
4 | 2 | 1 | 34
5 | 2 | 2 | 22
6 | 2 | 3 | 352
7 | 2 | 4 | 17
--------------------------------------------
Then, I wanna get results like this:
------------------------------------------
id_iti | min | id_ciudad | max | id_ciudad
------------------------------------------
1 | 1 | 374 | 3 | 241
2 | 1 | 34 | 4 | 17
------------------------------------------
I need to show the max and the min value in the same row group by id_iti.
I have tried to use full join, but I'm working with sqlite, and that's not an option. I spend a long day trying with different options but I can't found the solution. I hope you guys can help me.
Thanks in advance!
Edit:
SELECT a.id_iti, c.id_ciudad, d.id_ciudad
FROM detalle_itinerario as a,
(SELECT MAX(orden_iti),id_ciudad, id_iti FROM detalle_itinerario) AS c
INNER JOIN
(SELECT MIN(orden_iti),id_ciudad, id_iti FROM detalle_itinerario) AS d
ON c.id_iti=d.id_iti
GROUP BY a.id_iti;
That's only one of my attempts, but I get just values of the first coincidence.
First, use a simple query to get the min/max values for each group:
SELECT id_iti,
MIN(orden_iti) AS min,
MAX(orden_iti) AS max,
FROM detalle_itinerario
GROUP BY id_iti;
You can the use these values to join back to the original table:
SELECT a.id_iti,
a.min,
a2.id_ciudad,
a.max,
a3.id_ciudad
FROM (SELECT id_iti,
MIN(orden_iti) AS min,
MAX(orden_iti) AS max
FROM detalle_itinerario
GROUP BY id_iti) AS a
JOIN detalle_itinerario AS a2 ON a.id_iti = a2.id_iti AND a.min = a2.orden_iti
JOIN detalle_itinerario AS a3 ON a.id_iti = a3.id_iti AND a.max = a3.orden_iti;
I am using the DGET function in LibreOffice. I have the first table as shown below (top). I want to make second table (bottom). I can use DGET function where Database is the cell range containing top table and Database Field is "Winner".
Is it possible to have different cell ranges in Search Criteria, so that for each cell in row for Case #1 can have separate formula with a different search criteria as given in the first row of bottom table?
If I have to use separate continuous cell ranges for search criteria, then there would be [n*Chances] cell ranges, where n=total number of cases (~150 in my case) and Chances = possible number of Chance# (50 in my case).
Case | Chance# | Winner
-------------------------
1 | 7 | Joe
1 | 9 | Emil
1 | 10 | Harry
1 | 11 | Kate
2 | 1 | Tom
2 | 3 | Jerry
2 | 4 | Mike
2 | 7 | John
Case |Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|Chance#|
|="=1" |="=2" |="=3" |="=4" |="=5" |="=6" |="=7" |="=8" |="=9" |="=10" |="=11" | ---- |="=50"
1 | | | | | | | Joe | |Emil |Harry | Kate | ---- |
2 | Tom | |Jerry |Mike | | | John | | | | | ---- |
To do so, you need to change your approach, instead of using DGET, I'm using a rather more complex method:
Considering your example:
A B C D
1 # Case Chance# Winner
2 1 1 7 Joe
3 2 1 9 Emil
4 3 1 10 Harry
5 4 1 11 Kate
6 5 2 1 Tom
7 6 2 3 Jerry
8 7 2 4 Mike
9 8 2 7 John
10
11 Case\Chance# 1 2 3 4
12 1
13 2 Tom Jerry Mike
I use the following:
=IF(SUMPRODUCT(($B$2:$B$9=$A12)*($C$2:$C$9=B$11)*($A$2:$A$9))> 0,INDEX($D$2:$D$9,SUMPRODUCT(($B$2:$B$9=$A12)*($C$2:$C$9=B$11)*($A$2:$A$9))),"")
Let's ignore the IF, and focus on the real deal here:
First, Get the row that matches your condition, $B$2:$B$9=$A12 and $C$2:$C$9=B$11 will result in a TRUE/FALSE arrays, multiply them to get a 0/1 array with only a single 1 for the match, now multiply by the ID to get the row number in your table.
SUMPRODUCT will get you a single value (the row) from the result array.
Finally use index to retrieve the desired value.
The IF statement tests if a match do exist (SUMPRODUCT > 0), to filter out the cell with no match.