r remove records that dont represent all groups - r

After manipulating raw data we have obtained following data.frame
ItemID GroupID mentions
1 601 3 1
2 601 4 1
3 611 3 1
4 661 3 1
5 801 3 1
6 821 3 1
6 841 1 3
6 841 2 3
6 841 3 3
6 841 4 3
I have 10000 records like this and my first goal is to figure our items that represent all 4 GroupID. First I tried to do this visually by plotting.
ggplot(item.stats, aes(x=ItemID, y=mentions, fill=GroupID)) +
geom_bar(stat='identity', position='dodge')
With the large dataset this didn't look like a sensible thing. What's best way to get good idea of how many items represent all groups and mentions the mentions.
In above example after filtering it should only have:
ItemID GroupID mentions
6 841 1 3
6 841 2 3
6 841 3 3
6 841 4 3
Trying to get meaningful visualization:
test.with.id <- transform(test,id=as.numeric(factor(ItemID)))
ggplot(test.with.id, aes(x=id, y=mentions, fill=GroupID)) +
geom_histogram(stat='identity', position='stack', binwidth = 2)
May be similar to this
How to plot multiple stacked histograms together in R?

You can group by ItemID, then filter based on if all 4 Group IDs are in the GroupID column:
df %>% group_by(ItemID) %>% filter(all(1:4 %in% GroupID))
# A tibble: 4 x 3
# Groups: ItemID [1]
# ItemID GroupID mentions
# <int> <int> <int>
#1 841 1 3
#2 841 2 3
#3 841 3 3
#4 841 4 3

Related

How to get the IDs of a cluster of nodes in a network using igraph in R?

I have an edge list that is in the following format:
# A tibble: 162,157 x 4
id source target weight
<int> <int> <int> <int>
1 1 2 166 3777
2 2 2 204 17527
3 3 2 279 999
4 4 2 373 6826
5 5 2 552 1313
6 6 2 664 680
7 7 2 670 7624
8 8 2 791 167
9 9 2 1015 99
10 10 2 1182 18716
# … with 162,147 more rows
I have created a graph from this data using igraph::graph_from_data_frame(df, directed=TRUE) and have plotted the results, which can be seen in the following image.
The plot was generated with the following code snippet.
ggraph(g) +
geom_node_point(size=0.6) +
theme_graph()
What I would like to do is figure out which nodes are in the 6 tiny clusters surrounded by whitespace. I realize that I could assign labels, but in this instance that would be impossible to read. Is there a more mathematical or programmatic approach to identifying what those nodes are using igraph?

R Help: Count Unique Values by Group [duplicate]

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 3 years ago.
Here is a sample dataset to illustrate my problem:
example=data.frame(Group1=c(1,1,1,2,2,10,15,23),
Group2=c(100,100,150,200,234,456,465,710),
UniqueID=c('ABC67DF','ADC45BN','ADC45BN','ADC44BB','BBG40ML','CXD99QA','BBG40ML','VDF72PX'))
This is what the dataset looks like:
Group1 Group2 UniqueID
1 100 ABC67DF
1 100 ADC45BN
1 150 ADC45BN
2 200 ADC44BB
2 234 BBG40ML
10 456 CXD99QA
15 465 BBG40ML
23 710 VDF72PX
I want to count the number of occurrences of each UniqueID and have a dataset that looks like this:
Group1 Group2 UniqueID Count
1 100 ABC67DF 1
1 100 ADC45BN 1
1 150 ADC45BN 2
2 200 ADC44BB 1
2 234 BBG40ML 1
10 456 CXD99QA 1
15 465 BBG40ML 2
23 710 VDF72PX 1
I have tried the following code:
library(plryr)
Count=count(data$UniqueID)
But this just squishes my dataset down to show only unique UniqueIDs. Can anyone help me acquire the desired dataset?
An R base solution
example$ones <- 1 # create a vector of 1's
example <- transform(example, Count = ave(ones, UniqueID, FUN=cumsum)) # get counts
example$ones <- NULL # delete vector of 1's previously created
example # check results
Group1 Group2 UniqueID Count
1 1 100 ABC67DF 1
2 1 100 ADC45BN 1
3 1 150 ADC45BN 2
4 2 200 ADC44BB 1
5 2 234 BBG40ML 1
6 10 456 CXD99QA 1
7 15 465 BBG40ML 2
8 23 710 VDF72PX 1

R - table() returns repeated factors

I'm using FiveThirtyEight's Star Wars survey.
On $Anakin I've assigned 0 (very unfavourably) to 5 (very favourably) as categorical variables to the respondent's view of Anakin. "N/A" on the survey was assigned "". (Did that step on MS Excel)
$Startrek contains whether the respondent's seen Star Trek or not.
starwars <- read.csv2("starsurvey.csv", header = TRUE, stringsAsFactors = FALSE)
as.factor(starwars$Anakin)
as.factor(starwars$Startrek)
tbl <- table(starwars$Anakin, starwars$Startrek)
The table() function returns this:
No Yes
1 0 20 19
2 2 31 50
3 0 68 67
4 1 140 128
5 5 101 139
I'm wondering why the function returns 0, 2, 0, 1, 5 for the factors in $Anakin, since it contains:
starwars$Anakin
[1] 5 <NA> 4 5 2 5 4 3 4 5 <NA> <NA> 4 4
[15] 4 2 3 5 5 5 4 3 3 2 5 <NA> 4 4
[29] 1 1 3 5 2 <NA> <NA> 5 5 4 4 4 3 4
[43] 4 4 4 4 <NA> 2 3 <NA> 4 4 5 4 4 <NA>
The output of table here is confusing because your factor levels (1 to 5) look like row numbers, and there are some blank ("") responses to the Startrek variable which makes it appear like the data is only under the No and Yes columns.
So, the data here is a 5 by 3 table, with the rows representing the score from Anakin (1 to 5) and the columns representing 3 types of response to Startrek ("", No, Yes).
Note that where there are NA's in Anakin, this data is ingored in the table. To count these too, use addNA:
table(addNA(starwars$Anakin), starwars$Startrek)

R: Sum column from table 2 based on value in table 1, and store result in table 1

I am a R noob, and hope some of you can help me.
I have two data sets:
- store (containing store data, including location coordinates (x,y). The location are integer values, corresponding to GridIds)
- grid (containing all gridIDs (x,y) as well as a population variable TOT_P for each grid point)
What I want to achieve is this:
For each store I want loop over the grid date, and sum the population of the grid ids close to the store grid id.
I.e basically SUMIF the grid population variable, with the condition that
grid(x) < store(x) + 1 &
grid(x) > store(x) - 1 &
grid(y) < store(y) + 1 &
grid(y) > store(y) - 1
How can I accomplish that? My own take has been trying to use different things like merge, sapply, etc, but my R inexperience stops me from getting it right.
Thanks in advance!
Edit:
Sample data:
StoreName StoreX StoreY
Store1 3 6
Store2 5 2
TOT_P GridX GridY
8 1 1
7 2 1
3 3 1
3 4 1
22 5 1
20 6 1
9 7 1
28 1 2
8 2 2
3 3 2
12 4 2
12 5 2
15 6 2
7 7 2
3 1 3
3 2 3
3 3 3
4 4 3
13 5 3
18 6 3
3 7 3
61 1 4
25 2 4
5 3 4
20 4 4
23 5 4
72 6 4
14 7 4
178 1 5
407 2 5
26 3 5
167 4 5
58 5 5
113 6 5
73 7 5
76 1 6
3 2 6
3 3 6
3 4 6
4 5 6
13 6 6
18 7 6
3 1 7
61 2 7
25 3 7
26 4 7
167 5 7
58 6 7
113 7 7
The output I am looking for is
StoreName StoreX StoreY SUM_P
Store1 3 6 479
Store2 5 2 119
I.e for store1 it is the sum of TOT_P for Grid fields X=[2-4] and Y=[5-7]
One approach would be to use dplyr to calculate the difference between each store and all grid points and then group and sum based on these new columns.
#import library
library(dplyr)
#create example store table
StoreName<-paste0("Store",1:2)
StoreX<-c(3,5)
StoreY<-c(6,2)
df.store<-data.frame(StoreName,StoreX,StoreY)
#create example population data (copied example table from OP)
df.pop
#add dummy column to each table to enable cross join
df.store$k=1
df.pop$k=1
#dplyr to join, calculate absolute distance, filter and sum
df.store %>%
inner_join(df.pop, by='k') %>%
mutate(x.diff = abs(StoreX-GridX), y.diff=abs(StoreY-GridY)) %>%
filter(x.diff<=1, y.diff<=1) %>%
group_by(StoreName) %>%
summarise(StoreX=max(StoreX), StoreY=max(StoreY), tot.pop = sum(TOT_P) )
#output:
StoreName StoreX StoreY tot.pop
<fctr> <dbl> <dbl> <int>
1 Store1 3 6 721
2 Store2 5 2 119

Count of row frequency in a specific range

I have a database df_final with many rows (3000) and 4 columns.
To get a count the number of times that each number occurs in a specific column , I'm using this:
counts <- ddply(df_final, .(round(df_final$`Nº HB (1-8)`)), nrow)
names(counts) <- c("HB", "% ")
Output looks like:
1 4 1
2 5 34
3 6 470
4 7 1886
5 8 609
However, what I really need is the frequency of numbers between a range, for example (0-8).
Output should look like:
1 1 0
2 2 0
3 3 0
4 4 0
5 5 34
6 6 470
7 7 1886
8 8 609
We can use table after specifying the levels
table(factor(round(df_final$"Nº HB (1-8)"), levels = 1:8)

Resources