R Help: Count Unique Values by Group [duplicate] - r

This question already has answers here:
Numbering rows within groups in a data frame
(10 answers)
Closed 3 years ago.
Here is a sample dataset to illustrate my problem:
example=data.frame(Group1=c(1,1,1,2,2,10,15,23),
Group2=c(100,100,150,200,234,456,465,710),
UniqueID=c('ABC67DF','ADC45BN','ADC45BN','ADC44BB','BBG40ML','CXD99QA','BBG40ML','VDF72PX'))
This is what the dataset looks like:
Group1 Group2 UniqueID
1 100 ABC67DF
1 100 ADC45BN
1 150 ADC45BN
2 200 ADC44BB
2 234 BBG40ML
10 456 CXD99QA
15 465 BBG40ML
23 710 VDF72PX
I want to count the number of occurrences of each UniqueID and have a dataset that looks like this:
Group1 Group2 UniqueID Count
1 100 ABC67DF 1
1 100 ADC45BN 1
1 150 ADC45BN 2
2 200 ADC44BB 1
2 234 BBG40ML 1
10 456 CXD99QA 1
15 465 BBG40ML 2
23 710 VDF72PX 1
I have tried the following code:
library(plryr)
Count=count(data$UniqueID)
But this just squishes my dataset down to show only unique UniqueIDs. Can anyone help me acquire the desired dataset?

An R base solution
example$ones <- 1 # create a vector of 1's
example <- transform(example, Count = ave(ones, UniqueID, FUN=cumsum)) # get counts
example$ones <- NULL # delete vector of 1's previously created
example # check results
Group1 Group2 UniqueID Count
1 1 100 ABC67DF 1
2 1 100 ADC45BN 1
3 1 150 ADC45BN 2
4 2 200 ADC44BB 1
5 2 234 BBG40ML 1
6 10 456 CXD99QA 1
7 15 465 BBG40ML 2
8 23 710 VDF72PX 1

Related

Create a new dataframe in R resulting from comparison of differently ordered columns from two other databases with different lengths

I have this two dataframe CDD26_FF (5593 rows) and CDD_HI (5508 rows) having a structure (columns) like below. CDDs are "consecutive dry days", and the two table show species exposure to CDD in far future (FF) and historical period (HI).
I want to focus only on "Biom" and "Species_name" columnes.
As you can see the two table have same "Species_names" and same "Biom" (areas in the world with sama climatic conditions). "Biom" values goes from 0 to 15. By the way, "Species_name" do not always appear in both tables (e.g. Abromoco_ben); Furthemore, the two tables not always have the combinations of "Species_name" and "Biom" (combinations are simply population of the same species belonging to that Biom)
CDD26_FF :
CDD26_FF
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTotal
1
1
13
10
Abrocomo_ben
0.076923
1
1
8
1
Abrocomo_cin
0.125000
1
1
30
10
Abrocomo_cin
0.033333
1
2
10
1
Abrothrix_an
0.200000
1
1
44
10
Abrothrix_an
0.022727
1
3
6
2
Abrothrix_je
0.500000
1
1
7
12
Abrothrix_lo
0.142857
CDD_HI
CDD_HI
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTot_HI
1
1
8
1
Abrocomo_cin
0.125000
1
5
30
10
Abrocomo_cin
0.166666
1
1
5
2
Abrocomo_cin
0.200000
1
1
10
1
Abrothrix_an
0.100000
1
1
44
10
Abrothrix_an
0.022727
1
6
18
1
Abrothrix_je
0.333333
1
1
23
4
Abrothrix_lo
0.130434
I want to highlight rows that have same matches of "Species_name" and "Biom": in the example they are lines 3, 4, 5 from CDD26_FF matching lines 2, 4, 5 from CDD_HI, respectively. I want to store these line in a new table, but I want to store not only "Species_name" and "Biom" column (as "compare()" function seems to do), but also all the other columns.
More precisely, I want then to calculate the ratio of "AreaCellSuAreaTot" / "AreaCellSuAreaTot_HI" from the highlighted lines.
How can I do that?
Aside from "compare()", I tried a "for" loop, but lengths of the table differ, so I tried with a 3-nested for loop, still without results. I also tried "compareDF()" and "semi_join()". No results untill now. Thank you for your help.
You could use an inner join (provided by dplyr). An inner join returns all datasets that are present in both tables/data.frames and with matching conditions (in this case: matching "Biom" and "Species_name").
Subsequently it's easy to calculate some ratio using mutate:
library(dplyr)
cdd26_f %>%
inner_join(cdd_hi, by=c("Biom", "Species_name")) %>%
mutate(ratio = AreaCellSuAreaTotal/AreaCellSuAreaTot_HI) %>%
select(Biom, Species_name, ratio)
returns
# A tibble: 4 x 3
Biom Species_name ratio
<dbl> <chr> <dbl>
1 1 Abrocomo_cin 1
2 10 Abrocomo_cin 0.200
3 1 Abrothrix_an 2
4 10 Abrothrix_an 1
Note: Remove the select-part, if you need all columns or manipulate it for other columns.
Data
cdd26_f <- readr::read_table2("CDD26_FF AreaCell Area_total Biom Species_name AreaCellSuAreaTotal
1 1 13 10 Abrocomo_ben 0.076923
1 1 8 1 Abrocomo_cin 0.125000
1 1 30 10 Abrocomo_cin 0.033333
1 2 10 1 Abrothrix_an 0.200000
1 1 44 10 Abrothrix_an 0.022727
1 3 6 2 Abrothrix_je 0.500000
1 1 7 12 Abrothrix_lo 0.142857")
cdd_hi <- readr::read_table2("CDD_HI AreaCell Area_total Biom Species_name AreaCellSuAreaTot_HI
1 1 8 1 Abrocomo_cin 0.125000
1 5 30 10 Abrocomo_cin 0.166666
1 1 5 2 Abrocomo_cin 0.200000
1 1 10 1 Abrothrix_an 0.100000
1 1 44 10 Abrothrix_an 0.022727
1 6 18 1 Abrothrix_je 0.333333
1 1 23 4 Abrothrix_lo 0.130434")

Capturing all items in basket (R/arulesSequences)

I am having an issue with the arulesSequences package in R. I was able to read baskets into the program, and create a data.frame, however it fails to recognize any other items beyond the first column. Below is a sample of my data set, which follows the form demonstrated here: Data Mining Algorithms in R/Sequence Mining/SPADE.
[sequenceID] [eventID] [SIZE] items
2 1 1 OB/Gyn
15 1 1 Internal_Medicine
15 2 1 Internal_Medicine
15 3 1 Internal_Medicine
56 1 2 Internal_Medicine Neurology
84 1 1 Oncology
151 1 2 Hematology Hematology
151 2 1 Hematology/Oncology
151 3 1 Hematology/Oncology
185 1 2 Gastroenterology Gastroenterology
The dataset was exported from SAS as a [.CSV] then converted to a tab-delimited [.TXT] file in Excel. Headers were removed for import into R, but I placed them in brackets above for clarity in this example. All spaces were replaced with an underscore ("_"), and item names were simplified as much as possible. Each item is listed in a separate column. The following command was used to import the file:
baskets <- read_baskets(con = "...filepath/spade.txt", sep = "[ \t]+",info=c("sequenceID", "eventID", "SIZE"))
I am presented with no errors, so I continue with the following command:
as(baskets, "data.frame")
Here, it returns the data.frame as requested, however it fails to capture the items beyond the first column:
items sequenceID eventID SIZE
{OB/Gyn} 2 1 1
{Internal_Medicine} 15 1 1
{Internal_Medicine} 15 2 1
{Internal_Medicine} 15 3 1
{Internal_Medicine} 56 1 2
{Oncology} 84 1 1
{Hematology} 151 1 2
{Hematology/Oncology} 151 2 1
{Hematology/Oncology} 151 3 1
{Gastroenterology} 185 1 2
Line 5 should look like:
{Internal_Medicine, Neurology} 56 1 2
I have tried importing the file directly as a [.CSV], but the data.frame results in a similar format to my above attempt using tabs, except it places a comma in front of the first item:
{,Internal_Medicine} 56 1 2
Any troubleshooting suggestions would be greatly appreciated. It seems like this package is picky when it comes to formatting.
Line 5 should look like:
{Internal_Medicine, Neurology} 56 1 2
Check out
library(arulesSequences)
packageVersion("arulesSequences")
# [1] ‘0.2.16’
packageVersion("arules")
# [1] ‘1.5.0’
txt <- readLines(n=10)
2 1 1 OB/Gyn
15 1 1 Internal_Medicine
15 2 1 Internal_Medicine
15 3 1 Internal_Medicine
56 1 2 Internal_Medicine Neurology
84 1 1 Oncology
151 1 2 Hematology Hematology
151 2 1 Hematology/Oncology
151 3 1 Hematology/Oncology
185 1 2 Gastroenterology Gastroenterology
writeLines(txt, tf<-tempfile())
baskets <- read_baskets(con = tf, sep = "[ \t]+",info=c("sequenceID", "eventID", "SIZE"))
as(baskets, "data.frame")
# items sequenceID eventID SIZE
# 1 {OB/Gyn} 2 1 1
# 2 {Internal_Medicine} 15 1 1
# 3 {Internal_Medicine} 15 2 1
# 4 {Internal_Medicine} 15 3 1
# 5 {Internal_Medicine,Neurology} 56 1 2 # <----------
# 6 {Oncology} 84 1 1
# 7 {Hematology} 151 1 2
# 8 {Hematology/Oncology} 151 2 1
# 9 {Hematology/Oncology} 151 3 1
# 10 {Gastroenterology} 185 1 2

Count of row frequency in a specific range

I have a database df_final with many rows (3000) and 4 columns.
To get a count the number of times that each number occurs in a specific column , I'm using this:
counts <- ddply(df_final, .(round(df_final$`Nº HB (1-8)`)), nrow)
names(counts) <- c("HB", "% ")
Output looks like:
1 4 1
2 5 34
3 6 470
4 7 1886
5 8 609
However, what I really need is the frequency of numbers between a range, for example (0-8).
Output should look like:
1 1 0
2 2 0
3 3 0
4 4 0
5 5 34
6 6 470
7 7 1886
8 8 609
We can use table after specifying the levels
table(factor(round(df_final$"Nº HB (1-8)"), levels = 1:8)

Select specific rows based on previous row value (in the same column)

I've been trying to figure a way to script this through R, but just can't get it. I have a dataset like this:
Trial Type Correct Latency
1 55 0 0
3 30 1 766
4 10 1 344
6 40 1 716
7 10 1 326
9 30 1 550
10 10 1 350
11 64 0 0
13 30 1 683
14 10 1 270
16 30 1 666
17 10 1 297
19 40 1 616
20 10 1 315
21 64 0 0
23 40 1 850
24 10 1 322
26 30 1 566
27 20 0 766
28 40 1 500
29 20 1 230
which goes for much longer(around 1000 rows).
From this one dataset, I would like to create 4 separate data.frames/tables I can export tables with as well as do my own calculations
I would like to have a data.frame (4 in total), one for each of these bullet points:
type 10 rows which are preceded by a type 30 row
type 10 rows which are preceded by a type 40 row
type 20 rows which are preceded by a type 30 row
type 20 rows which are preceded by a type 40 row
I would like for all the columns in the relevant rows to be placed into these new tables, but only including the column info of row types 10 or 20.
For example, the first table (type 10 preceded by type 30) would like this based on the sample data:
Trial Type Correct Latency
4 10 1 344
10 10 1 350
14 10 1 270
17 10 1 297
Second table (type 10 preceded by type 40):
Trial Type Correct Latency
7 10 1 326
20 10 1 315
24 10 1 322
Third table (type 20 preceded by type 30):
Trial Type Correct Latency
27 20 0 766
Fourth table (table 20 preceded by type 40):
Trial Type Correct Latency
29 20 1 230
I can subset just fine to get one table only of type 10 rows and another for type 20 rows, but I can't figure out how to create different tables for type 10 and 20 rows based on the previous type value. Also, an issue is that "Trials" is not in order (skips numbers).
Any help would be greatly appreciated. Thank you.
Also, is there a way to include the previous row as well, so the output for the fourth table would look something like this:
Fourth table (table 20 preceded by type 40):
Trial Type Correct Latency
28 40 1 500
29 20 1 230
For the fourth example, you could use which() in combination with lag() from dplyr, to attain the indices that meet your criteria. Then you can use these to subset the data.frame.
# Get indices of rows that meet condition
ind2 <- which(df$Type==20 & dplyr::lag(df$Type)==40)
# Get indices of rows before the ones that meet condition
ind1 <- which(df$Type==20 & dplyr::lag(df$Type)==40)-1
# Subset data
> df[c(ind1,ind2)]
Trial Type Correct Latency
1: 28 40 1 500
2: 29 20 1 230
Here is an example code if you always want to delete the first trials of your data.
var1 <- c(1,2,1,2,1,2,1,2,1,2)
var2 <- c(1,1,1,2,2,2,2,3,3,3)
dat <- data.frame(var1, var2)
var1 var2
1 1 1
2 2 1
3 1 1
4 2 2
5 1 2
6 2 2
7 1 2
8 2 3
9 1 3
10 2 3
#delete only this line directly
filter(dat,lag(var2)==var2)
var1 var2
1 1 1
2 2 1
3 1 1
6 2 2
7 1 2
10 2 3
#delete the first 2 trials
#make a list of all rows where var2[n-1]!=var2[n] --> using lag from dplyr
drops <- c(1,2,which(lag(dat$var2)!=dat$var2), which(lag(dat$var2)!=dat$var2)+1)
if (!identical(drops,numeric(0))) { dat <- dat[-drops,] }
var1 var2
3 1 1
6 2 2
7 1 2
10 2 3

Random Sample with Replacement Loop

I have an R script that allows me to select a sample size and take fifty individual random samples with replacement. Below is an example of this code:
## Creates data frame
df = as.data.table(data)
## Select sample size
sample.size = 5
## Creates Sample 1 (Size 5)
Sample.1<-df[,
Dollars[sample(.N, size=sample.size, replace=TRUE)], by = Num]
Sample.1$Sample <- c("01")
According to the R script above, I first created a data frame. I then select my sample size, which in this case is 5. This represents just one sample. Due to my lack of experience with R, I repeat this code 49 more times. The last piece of code looks like this:
## Creates Sample 50 (Size 5)
Sample.50<-df[,
Dollars[sample(.N, size=sample.size, replace=TRUE)], by = Num]
Sample.50$Sample <- c("50")
The sample output would look something like this (Sample Range 1 - 50):
Num Dollars Sample
1 85000 01
1 4900 01
1 18000 01
1 6900 01
1 11000 01
1 8800 50
1 3800 50
1 10400 50
1 2200 50
1 29000 50
It should be noted that varaible 'Num' was created for grouping purposes and has little to no influence on my overall question (which is posted below).
Instead of repeating this code fifty times, to get me fifty individual samples (with a size of 5), is there a loop I can create to help me limit my code? I have been recently asked to create ten thousand random samples, each of a size of 5. I obviously cannot repeat this code ten thousand times so I need some sort of loop.
A sample of my final output should look something like this (Sample Range 1 - 10,000):
Num Dollars Sample
1 85000 01
1 4900 01
1 18000 01
1 6900 01
1 11000 01
1 9900 10000
1 8300 10000
1 10700 10000
1 6800 10000
1 31000 10000
Thank you all in advance for your help, its greatly appreciated.
Here is some sample code if needed:
Num Dollars
1 31002
1 13728
1 23526
1 80068
1 86244
1 9330
1 27169
1 13694
1 4781
1 9742
1 20060
1 35230
1 15546
1 7618
1 21604
1 8738
1 5299
1 12081
1 7652
1 16779
A very simple method would be to use a for loop and store the results in a list:
lst <- list()
for(i in seq_len(3)){
lst[[i]] <- df[sample(seq_len(nrow(df)), 5, replace = TRUE),]
lst[[i]]["Sample"] <- i
}
> lst
[[1]]
Num Dollars Sample
20 1 16779 1
1 1 31002 1
12 1 35230 1
14 1 7618 1
14.1 1 7618 1
[[2]]
Num Dollars Sample
9 1 4781 2
13 1 15546 2
12 1 35230 2
17 1 5299 2
12.1 1 35230 2
[[3]]
Num Dollars Sample
1 1 31002 3
7 1 27169 3
17 1 5299 3
5 1 86244 3
6 1 9330 3
Then, to create a single data.frame, use do.call to rbind the list elements together:
do.call(rbind, lst)
Num Dollars Sample
20 1 16779 1
1 1 31002 1
12 1 35230 1
14 1 7618 1
14.1 1 7618 1
9 1 4781 2
13 1 15546 2
121 1 35230 2
17 1 5299 2
12.1 1 35230 2
11 1 31002 3
7 1 27169 3
171 1 5299 3
5 1 86244 3
6 1 9330 3
It's worth noting that if you're sampling with replacement, then drawing 50 (or 10,000) samples of size 5 is equivalent to drawing one sample of size 250 (or 50,000). Thus I would do it like this (you'll see I stole a line from #beginneR's answer):
df = as.data.table(data)
## Select sample size
sample.size = 5
n.samples = 10000
# Sample and assign groups
draws <- df[sample(seq_len(nrow(df)), sample.size * n.samples, replace = TRUE), ]
draws[, Sample := rep(1:n.samples, each = sample.size)]

Resources