How to do conditional replace value in data - r

I have below data table and want to replace other than Web,Mobile usage into another category say others. (OR)
Is there anyway group the usage as Web, Mobile and rest all as others without replace the value like Web is used 1 , Mobile 1 and Others - 4 (OR)
Do we need to write a function to do so.
Id Name Usage
1 AAA Web
2 BBB Mobile
3 CCC Manual
4 DDD M1
5 EEE M2
6 FFF M3

Assuming that the 'Usage' is character class, we can use %chin% to create a logical index, negate it (!) and assign (:=) values in 'Usage' to 'Others'. This would be more efficient as we are assigning in place without any copying.
library(data.table)
setDT(df1)[!Usage %chin% c("Web", "Mobile"), Usage := "Others"]
df1
# Id. Name Usage
#1: 1 AAA Web
#2: 2 BBB Mobile
#3: 3 CCC Others
#4: 4 DDD Others
#5: 5 EEE Others
#6: 6 FFF Others

Related

I want to match a column from one table with another table and replace those matching values [duplicate]

This question already has answers here:
Drop data frame columns by name
(25 answers)
Closed 5 years ago.
i have a table say,Table A :
uid pid code
1 1 aaa
2 1 ccc
3 4 ddd
4 2 eee
i have another table, Table B:
pid msg
1 good
2 inspiring
3 thing to wtch
4 terrible
now, i want to replace pid in table A with msg in Table B.
I used merge(tableA, tableb, by =c("pid"))
I got the result as
uid pid code msg
1 1 aaa good
2 1 ccc good
3 4 ddd terrible
4 2 eee inspiring
where in i want the result as
uid msg code
1 good aaa
2 good ccc
3 terrible ddd
4 inspiring eee
Your approach seems absolutely correct, just needs further steps:
selection of required columns
reordering them
With tidyverse functions, you can do something like:
TableA %>%
left_join(TableB) %>%
select(uid, msg, code)
which gives:
uid msg code
1 1 good aaa
2 2 good ccc
3 3 terrible ddd
4 4 inspiring eee
Base R solution:
newtable = merge(tableA,tableB,by = "pid")
newtable$pid = NULL
newtable = newtable[order(newtable$uid,decreasing=FALSE),c(1,3,2)]

Get frequency counts for a subset of elements in a column

I may be missing some elegant ways in Stata to get to this example, which has to do with electrical parts and observed monthly failures etc.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
I would like to group by (bysort) each PartID and record the highest frequency for FailType within each PartID type. Ties can be broken arbitrarily, and preferably, the lower one can be picked.
I looked at groups etc., but do not know how to peel off certain elements from the result set. So that is a major question for me. If you execute a query, how do you select only the elements you want for the next computation? Something like n(0) is the count, n(1) is the mean etc. I was able to use contract, bysort etc. and create a separate data set which I then merged back into the main set with an extra column There must be something simple using gen or egen so that there is no need to create an extra data set.
The expected results here will be:
PartID Freq
ABD 4 #(4 occurs twice)
ABC 2 #(tie broken with minimum)
BBB 0 #(0 occurs 3 times)
Please let me know how I can pick off specific elements that I need from a result set (can be from duplicate reports, tab etc.)
Part II - Clarification: Perhaps I should have clarified and split the question into two parts. For example, if I issue this followup command after running your code: tabdisp Type, c(Freq). It may print out a nice table. Can I then use that (derived) table to perform more computations programatically?
For example get the first row of the table.
Table. ----------------------
Type| Freq ----------+-----------
A | -1
B | -1
C | -1
D | -3
S | -3
---------------------- –
I found this difficult to follow (see comment on question), but some technique is demonstrated here. The numbers of observations in subsets of observations defined by by: are given by _N. The rest is sorting tricks. Negating the frequency is a way to select the highest frequency and the lowest Type which I think is what you are after when splitting ties. Negating back gets you the positive frequencies.
clear
input str3 (PartID Type FailType)
ABD A 4
BBB S 0
ABD A 3
ABD A 4
ABC A 2
BBB A 0
ABD B 1
ABC B 7
BBB C 1
BBB D 0
end
bysort PartID FailType: gen Freq = -_N
bysort PartID (Freq Type) : gen ToShow = _n == 1
replace Freq = -Freq
list PartID Type FailType Freq if ToShow
+---------------------------------+
| PartID Type FailType Freq |
|---------------------------------|
1. | ABC A 2 1 |
3. | ABD A 4 2 |
7. | BBB A 0 3 |
+---------------------------------+

Searching a vector/data table backwards in R

Basically, I have a very large data frame/data table and I would like to search a column for the first, and closest, NA value which is less than my current index position.
For example, let's say I have a data frame DF as follows:
INDEX | KEY | ITEM
----------------------
1 | 10 | AAA
2 | 12 | AAA
3 | NA | AAA
4 | 18 | AAA
5 | NA | AAA
6 | 24 | AAA
7 | 29 | AAA
8 | 31 | AAA
9 | 34 | AAA
From this data frame we have an NA value at index 3 and at index 5. Now, let's say we start at index 8 (which has KEY of 31). I would like to search the column KEY backwards such that the moment it finds the first instance of NA the search stops, and the index of the NA value is returned.
I know there are ways to find all NA values in a vector/column (for example, I can use which(is.na(x)) to return the index values which have NA) but due to the sheer size of the data frame I am working and due to the large number of iterations that need to be performed this is a very inefficient way of doing it. One method I thought of doing is creating a kind of "do while" loop and it does seem to work, but this again seems quite inefficient since it needs to perform calculations each time (and given that I need to do over 100,000 iterations this does not look like a good idea).
Is there a fast way of searching a column backwards from a particular index such that I can find the index of the closest NA value?
Why not do a forward-fill of the NA indexes once, so that you can then look up the most recent NA for any row in future:
library(dplyr)
library(tidyr)
df = df %>%
mutate(last_missing = if_else(is.na(KEY), INDEX, as.integer(NA))) %>%
fill(last_missing)
Output:
> df
INDEX KEY ITEM last_missing
1 1 10 AAA NA
2 2 12 AAA NA
3 3 NA AAA 3
4 4 18 AAA 3
5 5 NA AAA 5
6 6 24 AAA 5
7 7 29 AAA 5
8 8 31 AAA 5
9 9 34 AAA 5
Now there's no need to recalculate every time you need the answer for a given row. There may be more efficient ways to do the forward fill, but I think exploring those is easier than figuring out how to optimise the backward search.

sqlite: select display few rows around matching rows

I have a table of the form
select rowid,* from t;
rowid f
----- -----
1 aaa
2 bbb
3 ccc
4 ddd
5 eee
6 fff
7 ggg
8 aaa
9 bbb
10 ccc
11 ddd
12 eee
13 fff
14 ggg
Id like to select n row before and m row after a given row match, i.e for instance for rows that match f='ccc' with n=m=1 I'd like to get
2 bbb
3 ccc
4 ddd
9 bbb
10 ccc
11 ddd
The rowid is sequential in my setup so I guess we can play with it. I tried thing along the line of
select rowid,f from t where rowid between
(select rowid-1 from t where f='ccc') and
(select rowid+1 from t where f='ccc');
rowid f
----- -----
2 bbb
3 ccc
4 ddd
But the result is obviously wrong I got only the 1st occurence of the 'ccc' match. I guess I got to join or may be recursive cte, but I am affraid it is beyound my knowlegde so far :) Thanx in advance.
A scalar subquery can return only a single value.
You could do two self joins, but it would be simpler to use set operations:
SELECT * FROM t
WHERE rowid IN (SELECT rowid - 1 FROM t WHERE f = 'ccc'
UNION ALL
SELECT rowid FROM t WHERE f = 'ccc'
UNION ALL
SELECT rowid + 1 FROM t WHERE f = 'ccc');
Larger values of n and m require more subqueries.
If there are too many, you can use a join:
SELECT *
FROM t
WHERE rowid IN (SELECT t.rowid
FROM t
JOIN (SELECT rowid - ? AS n,
rowid + ? AS m
FROM t
WHERE f = 'ccc'
) AS ranges
ON t.rowid BETWEEN ranges.n AND ranges.m);
I come with a solution that is not optimal I think but I am not able to simplify (remove) the temp (intermediate) table.
select rowid,f from t;
rowid f
----- -----
1 aaa
2 bbb
3 ccc
4 ddd
5 eee
6 fff
7 ggg
8 aaa
9 bbb
10 ccc
11 ddd
12 eee
13 fff
14 ggg
create table u as
select t2.rowid x,t1.rowid+2 y from t t1 // +2 ==> 2 rows after 'ccc'
join t t2 on t1.rowid=t2.rowid+1 // +1 ==> 1 row before 'ccc'
where t1.f='ccc';
select * from u;
x y
----- -----
2 5
9 12
select t.rowid,t.f from t inner
join u on t.rowid>=u.x and t.rowid<=u.y'
rowid f
----- -----
2 bbb 1 before
3 ccc <== match
4 ddd 2 after
5 eee
9 bbb 1 before
10 ccc <== match
11 ddd 2 after
12 eee
I think I am set with what I need, but optimisations are welome :)
I might be overlooking something, but the provided solutions that suggest adding/subtracting from values of the rowid column could be improved upon. They will face issues should rowid ever be missing a value (Which I'm aware was stated to never be the case in the top post, but in general is an assumption that's often not true).
By using sqlite's row_number() you can have a solution that circumvents that problem and also can be used to fetch the entries "around" your row matches based on any arbitrary order you want, not just based on rowid.
Together with Common Table Expressions this can even be made somewhat readable, though should you have a larger amount of row-matches this will still be a slow query.
What you'll conceptually be doing is:
Do a pre-select on your table (here cte_t) to get all possible values that could be a valid hit and attach a row-number to each entry
Do a select on that pre-select to fetch the specific row that you actually want and get only its row-number (here targetRows)
"join" the two by pretty much just multiplying the two tables generated in 1) and 2).
Now you can easily select for all entries whose row-number is in a specific range around the target's row-number using ABS
WITH
cte_t AS (
SELECT *, row_number() OVER (ORDER BY t.rowid) AS rownum
FROM t
-- If you can make this cte smaller by removing all entries that can't possibly be the solution with the appropriate WHERE clause, you can make the entire query substantially faster
),
targetRows AS(
SELECT rownum AS targetRowNum
FROM cte_t
WHERE f = 'ccc' -- This should be the WHERE condition that defines the entries that match your query exactly and for which you want to get the entries around them
)
SELECT cte_t.rowid, cte_t.f
FROM cte_t, targetRows -- This is basically multiplying both tables with one another, this part will be horribly slow if targetRows gets larger
WHERE ABS(cte_t.rownum - targetRows.targetRowNum) <= 1; --Get all entries in targetRows as well as those whose rownum is 1 larger or 1 lower than the rownum of a targetRow
This will return
rowid f
2 bbb
3 ccc
4 ddd
9 bbb
10 ccc
11 ddd
Here a good resource about this.

Removing time series with only zero values from a data frame

I have a data frame with multiple time series identified by uniquer id's. I would like to remove any time series that have only 0 values.
The data frame looks as follows,
id date value
AAA 2010/01/01 9
AAA 2010/01/02 10
AAA 2010/01/03 8
AAA 2010/01/04 4
AAA 2010/01/05 12
B 2010/01/01 0
B 2010/01/02 0
B 2010/01/03 0
B 2010/01/04 0
B 2010/01/05 0
CCC 2010/01/01 45
CCC 2010/01/02 46
CCC 2010/01/03 0
CCC 2010/01/04 0
CCC 2010/01/05 40
I want any time series with only 0 values to be removed so that the data frame look as follows,
id date value
AAA 2010/01/01 9
AAA 2010/01/02 10
AAA 2010/01/03 8
AAA 2010/01/04 4
AAA 2010/01/05 12
CCC 2010/01/01 45
CCC 2010/01/02 46
CCC 2010/01/03 0
CCC 2010/01/04 0
CCC 2010/01/05 40
This is a follow up to a previous question that was answered with a really great solution using the data.tables package.
R efficiently removing missing values from the start and end of multiple time series in 1 data frame
If dat is a data.table, then this is easy to write and read :
dat[,.SD[any(value!=0)],by=id]
.SD stands for Subset of Data. This answer explains .SD very well.
Picking up on Gabor's nice use of ave, but without repeating the same variable name (DF) three times, which can be a source of typo bugs if you have a lot of long or similar variable names, try :
dat[ ave(value!=0,id,FUN=any) ]
The difference in speed between those two may be dependent on several factors including: i) number of groups ii) size of each group and iii) the number of columns in the real dat.
Try this. No packages are used.
DF[ ave(DF$value != 0, DF$id, FUN = any), ]
An easy plyr solution would be
ddply(mydat,"id",function(x) if (all(x$value==0)) NULL else x)
(seems to work OK) but there may be a faster solution with data.table ...

Resources