Function for certain values in rows - r

I have a paneldata which looks like:
(Only the substantially cutting for my question)
Persno 122 122 122 333 333 333 333 333 444 444
Income 1500 1500 2000 2000 2100 2500 2500 1500 2000 2200
year 1 2 3 1 2 3 4 5 1 2
I need a command or function to recognize the different Person. For all rows with the same Person I would like to give out the average income.
Thank you very much.

My favorite tool to solve problems like this is ddply, in the plyr package.
library(plyr)
p = pdata.frame(data.frame(year=rep(c(1,2,3),3), persno = c(1,1,1,2,2,2,3,3,3), income=c(1500,1500,2000,2000,2100,2500,2500,1500,2000)))
dply(p, .(persno), summarize, mean.income = mean(income))
which gives us the output
persno mean.income
1 1 1666.667
2 2 2200.000
3 3 2000.000

Related

Computing similarity between duplicated variables using unique identifier

I have a data set that looks like this, where id is supposed to be the unique identifier. There are duplicates, for example, lines 1 and 4, but not lines 1 and 6 or 3 and 6 due to the year difference. Variable dupfreq shows if there are any similar instances within the dataset, including that row.
id
year
tlabor
rev
dupfreq
1
1419
2005
5
1072
2
2
1425
2005
42
2945
1
3
1419
2005
4
950
2
4
1443
2006
18
3900
1
5
1485
2006
118
35034
1
6
1419
2006
6
1851
1
I want to check for row similarity (tlabor and rev) for those with dupfreq > 1, group by id and year.
I was thinking of something similar to this:
id
year
sim
1
1419
2005
0.83
Note that dupfreq can be >2, but if I can only generate the new table using rows with dupfreq==2 I am ok with it too.
Any advice is greatly appreciated! Thanks in advance!

Generating combinations based on 2 columns in R

I know that this question has been repeated multiple times but I am not able to look exactly for what I am looking for in the previous topics. Please feel free to close the topic in case that this is duplicated.
I have a dataframe as follows:
> data %>% arrange(customer_id)
region market unit_key
1 2 98 320
2 2 98 321
3 4 184 287
4 4 4 7
5 4 4 287
6 66 521 899
7 66 521 900
8 66 3012 899
9 66 521 916
10 66 3011 900
I would like to make a 4th column which is a unique identifier call combination id that is formed as follows:
So basically for each unique pair of region and market I should get a unique identifier that will allow me to retrieve the unit_keys that they are linked with the combination of markets for an specific region.
I tried to do it with a cross-join and with tidyr::crossing() but I didnt get the expected results.
Any hints on this topic?
BR
/Edgar
Unfortunately the proposed solution by:
df %>% group_by(region, market) %>% mutate(id = cur_group_id())
Does not work as I get the following result:
combination_id %>% arrange(region)
# A tibble: 373 x 4
# Groups: region, market [182]
region market unit_key id
<dbl> <dbl> <dbl> <int>
1 2 98 320 1
2 2 98 321 1
3 4 184 287 3
4 4 4 7 2
5 4 4 287 2
6 66 521 899 4
In this case, for region 4 we should have the following combinations:
id=2 where market is 184
id=3 where market is 4
id=4 where market is 4 and 184

r remove records that dont represent all groups

After manipulating raw data we have obtained following data.frame
ItemID GroupID mentions
1 601 3 1
2 601 4 1
3 611 3 1
4 661 3 1
5 801 3 1
6 821 3 1
6 841 1 3
6 841 2 3
6 841 3 3
6 841 4 3
I have 10000 records like this and my first goal is to figure our items that represent all 4 GroupID. First I tried to do this visually by plotting.
ggplot(item.stats, aes(x=ItemID, y=mentions, fill=GroupID)) +
geom_bar(stat='identity', position='dodge')
With the large dataset this didn't look like a sensible thing. What's best way to get good idea of how many items represent all groups and mentions the mentions.
In above example after filtering it should only have:
ItemID GroupID mentions
6 841 1 3
6 841 2 3
6 841 3 3
6 841 4 3
Trying to get meaningful visualization:
test.with.id <- transform(test,id=as.numeric(factor(ItemID)))
ggplot(test.with.id, aes(x=id, y=mentions, fill=GroupID)) +
geom_histogram(stat='identity', position='stack', binwidth = 2)
May be similar to this
How to plot multiple stacked histograms together in R?
You can group by ItemID, then filter based on if all 4 Group IDs are in the GroupID column:
df %>% group_by(ItemID) %>% filter(all(1:4 %in% GroupID))
# A tibble: 4 x 3
# Groups: ItemID [1]
# ItemID GroupID mentions
# <int> <int> <int>
#1 841 1 3
#2 841 2 3
#3 841 3 3
#4 841 4 3

in r, how can one trim or winsorize data by a factor

I'm trying to apply the winsor function at each level of a factor (subjects) in order to remove extreme cases. I can apply the winsor function to the entire column, but would like to do it within subject.
Subject RT
1 402
1 422
1 155
1 460
2 283
2 224
2 346
2 447
3 415
3 161
3 1
3 343
Ideally, I'd like the output to be a vector containing the same number of rows as the input but with outliers (e.g. the second last value of Subject 3) to be removed and replaced as per the winsor function.
you are looking for the ?by function
# for example:
by(myDF, myDF$Subject, winsor(myDF$RT))
However, using data.table (instead of data.frame) might be better suited for you
### broken down step by step:
library(data.table)
myDT <- data.table(myDF)
myDT[, winsorResult := winsor(RT), by=Subject]
library(psych)
transform(dat,win = ave(RT,Subject,FUN=winsor))
Subject RT win
1 1 402 402.0
2 1 422 422.0
3 1 155 303.2
4 1 460 437.2
5 2 283 283.0
6 2 224 259.4
7 2 346 346.0
8 2 447 386.4
9 3 415 371.8
10 3 161 161.0
11 3 1 97.0
12 3 343 343.0

Allow a maximum number of entries when certain conditions apply

I have a dataset with a lot of entries. Each of these entries belongs to a certain ID (belongID), the entries are unique (with uniqID), but multiple entries can come from the same source (sourceID). It is also possible that multiple entries from the same source have a the same belongID. For the purposes of the research I need to do on the dataset I have to get rid of the entries of a single sourceID that occur more than 5 times for 1 belongID. The maximum of 5 entries that need to be kept are the ones with the highest 'Time' value.
To illustrate this I have the following example dataset:
belongID sourceID uniqID Time
1 1001 101 5
1 1002 102 5
1 1001 103 4
1 1001 104 3
1 1001 105 3
1 1005 106 2
1 1001 107 2
1 1001 108 2
2 1005 109 5
2 1006 110 5
2 1005 111 5
2 1006 112 5
2 1005 113 5
2 1006 114 4
2 1005 115 4
2 1006 116 3
2 1005 117 3
2 1006 118 3
2 1005 119 2
2 1006 120 2
2 1005 121 1
2 1007 122 1
3 1010 123 5
3 1480 124 2
The example in the end should look like this:
belongID sourceID uniqID Time
1 1001 101 5
1 1002 102 5
1 1001 103 4
1 1001 104 3
1 1001 105 3
1 1005 106 2
1 1001 107 2
2 1005 109 5
2 1006 110 5
2 1005 111 5
2 1006 112 5
2 1005 113 5
2 1006 114 4
2 1005 115 4
2 1006 116 3
2 1005 117 3
2 1006 118 3
2 1007 122 1
3 1010 123 5
3 1480 124 2
There are a lot more columns with data entries in the file, but the selection has to be purely based on time. As shown in the example it can also occur that the 5th and 6th entry of a sourceID with the same belongID have the same time. In this case only 1 has to be chosen, because max=5.
The dataset here is nicely ordered on belongID and time for illustrative purposes, but in the real dataset this is not the case. Any idea how to tackle this problem? I have not come across something similar yet..
if dat is your dataframe:
do.call(rbind,
by(dat, INDICES=list(dat$belongID, dat$sourceID),
FUN=function(x) head(x[order(x$Time, decreasing=TRUE), ], 5)))
Say your data is in df. The ordered (by uniqID) output is obtained after this:
tab <- tapply(df$Time, list(df$belongID, df$sourceID), length)
bIDs <- rownames(tab)
sIDs <- colnames(tab)
for(i in bIDs)
{
if(all(is.na(tab[bIDs == i, ])))next
ids <- na.omit(sIDs[tab[i, sIDs] > 5])
for(j in ids)
{
cond <- df$belongID == i & df$sourceID == j
old <- df[cond,]
id5 <- order(old$Time, decreasing = TRUE)[1:5]
new <- old[id5,]
df <- df[!cond,]
df <- rbind(df, new)
}
}
df[order(df$uniqID), ]
A solution in two lines using the plyr package:
library(plyr)
x <- ddply(dat, .(belongID, sourceID), function(x)tail(x[order(x$Time), ], 5))
xx <- x[order(x$belongID, x$uniqID), ]
The results:
belongID sourceID uniqID Time
5 1 1001 101 5
6 1 1002 102 5
4 1 1001 103 4
2 1 1001 104 3
3 1 1001 105 3
7 1 1005 106 2
1 1 1001 108 2
10 2 1005 109 5
16 2 1006 110 5
11 2 1005 111 5
17 2 1006 112 5
12 2 1005 113 5
15 2 1006 114 4
9 2 1005 115 4
13 2 1006 116 3
8 2 1005 117 3
14 2 1006 118 3
18 2 1007 122 1
19 3 1010 123 5
20 3 1480 124 2
The dataset on which this method is going to be used has 170.000+ entries and almost 30 columns
Benchmarking each of the three provided solutions by danas.zuokas, mplourde and Andrie with the use of my dataset, provided the following outcomes:
danas.zuokas' solution:
User System Elapsed
2829.569 0 2827.86
mplourde's solution:
User System Elapsed
765.628 0.000 763.908
Aurdie's solution:
User System Elapsed
984.989 0.000 984.010
Therefore I will use mplourde's solution. Thank you all!
This should be faster, using data.table :
DT = as.data.table(dat)
DT[, .SD[tail(order(Time),5)], by=list(belongID, sourceID)]
Aside : suggest to count the number of times the same variable name is repeated in the various answers to this question. Do you ever have a lot of long or similar object names?

Resources