How to count the number of elements by group in R? - r

i have this data frame, I want to count the frequency (number) of each unique value in a column.
userID bookmarkID tagID value
228 1 1 0.0005
255 1 1 0.0007
5 2 1 0.0068
66 2 1 0.0008
99 2 1 0.0006
206 2 1 0.0006
3 3 1 -0.0007
5 3 1 0.0633
7 3 1 -0.0012
For example,the column bookmarkID, I want to get two vectors: one is the unique values [1,2,3], the other is the corresponding count: [2,4,3]. How can I do this?

I think you're looking for table and unique. Consider your data.frame is df,
> table(df$bookmarkID)
1 2 3
2 4 3
> unique(df$bookmarkID)
[1] 1 2 3

Related

How to set a reference row and let other rows to divide its values?

Here is a big data frame with different cells and their mRNA expression data:
A
Gene 1
Gene 2
Gene 3
Gene 4
…
Cell 1
9
12
24
42
30
Cell 2
3
6
12
21
15
Cell 3
6
42
48
84
45
…
Now I want the second column to be the standard and get this:
A
Gene 1
Gene 2
Gene 3
Gene 4
…
Cell 1
3
2
2
2
2
Cell 2
1
1
1
1
1
Cell 3
2
7
4
4
3
…
I think you can try rep like below
df/df[rep(2,nrow(df)),]
Here is one possible way to solve your problem (assuming that your data is named df):
df[] = Map("/", df, df[2,])

Create a new dataframe in R resulting from comparison of differently ordered columns from two other databases with different lengths

I have this two dataframe CDD26_FF (5593 rows) and CDD_HI (5508 rows) having a structure (columns) like below. CDDs are "consecutive dry days", and the two table show species exposure to CDD in far future (FF) and historical period (HI).
I want to focus only on "Biom" and "Species_name" columnes.
As you can see the two table have same "Species_names" and same "Biom" (areas in the world with sama climatic conditions). "Biom" values goes from 0 to 15. By the way, "Species_name" do not always appear in both tables (e.g. Abromoco_ben); Furthemore, the two tables not always have the combinations of "Species_name" and "Biom" (combinations are simply population of the same species belonging to that Biom)
CDD26_FF :
CDD26_FF
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTotal
1
1
13
10
Abrocomo_ben
0.076923
1
1
8
1
Abrocomo_cin
0.125000
1
1
30
10
Abrocomo_cin
0.033333
1
2
10
1
Abrothrix_an
0.200000
1
1
44
10
Abrothrix_an
0.022727
1
3
6
2
Abrothrix_je
0.500000
1
1
7
12
Abrothrix_lo
0.142857
CDD_HI
CDD_HI
AreaCell
Area_total
Biom
Species_name
AreaCellSuAreaTot_HI
1
1
8
1
Abrocomo_cin
0.125000
1
5
30
10
Abrocomo_cin
0.166666
1
1
5
2
Abrocomo_cin
0.200000
1
1
10
1
Abrothrix_an
0.100000
1
1
44
10
Abrothrix_an
0.022727
1
6
18
1
Abrothrix_je
0.333333
1
1
23
4
Abrothrix_lo
0.130434
I want to highlight rows that have same matches of "Species_name" and "Biom": in the example they are lines 3, 4, 5 from CDD26_FF matching lines 2, 4, 5 from CDD_HI, respectively. I want to store these line in a new table, but I want to store not only "Species_name" and "Biom" column (as "compare()" function seems to do), but also all the other columns.
More precisely, I want then to calculate the ratio of "AreaCellSuAreaTot" / "AreaCellSuAreaTot_HI" from the highlighted lines.
How can I do that?
Aside from "compare()", I tried a "for" loop, but lengths of the table differ, so I tried with a 3-nested for loop, still without results. I also tried "compareDF()" and "semi_join()". No results untill now. Thank you for your help.
You could use an inner join (provided by dplyr). An inner join returns all datasets that are present in both tables/data.frames and with matching conditions (in this case: matching "Biom" and "Species_name").
Subsequently it's easy to calculate some ratio using mutate:
library(dplyr)
cdd26_f %>%
inner_join(cdd_hi, by=c("Biom", "Species_name")) %>%
mutate(ratio = AreaCellSuAreaTotal/AreaCellSuAreaTot_HI) %>%
select(Biom, Species_name, ratio)
returns
# A tibble: 4 x 3
Biom Species_name ratio
<dbl> <chr> <dbl>
1 1 Abrocomo_cin 1
2 10 Abrocomo_cin 0.200
3 1 Abrothrix_an 2
4 10 Abrothrix_an 1
Note: Remove the select-part, if you need all columns or manipulate it for other columns.
Data
cdd26_f <- readr::read_table2("CDD26_FF AreaCell Area_total Biom Species_name AreaCellSuAreaTotal
1 1 13 10 Abrocomo_ben 0.076923
1 1 8 1 Abrocomo_cin 0.125000
1 1 30 10 Abrocomo_cin 0.033333
1 2 10 1 Abrothrix_an 0.200000
1 1 44 10 Abrothrix_an 0.022727
1 3 6 2 Abrothrix_je 0.500000
1 1 7 12 Abrothrix_lo 0.142857")
cdd_hi <- readr::read_table2("CDD_HI AreaCell Area_total Biom Species_name AreaCellSuAreaTot_HI
1 1 8 1 Abrocomo_cin 0.125000
1 5 30 10 Abrocomo_cin 0.166666
1 1 5 2 Abrocomo_cin 0.200000
1 1 10 1 Abrothrix_an 0.100000
1 1 44 10 Abrothrix_an 0.022727
1 6 18 1 Abrothrix_je 0.333333
1 1 23 4 Abrothrix_lo 0.130434")

Convert year-month string to three month bins with gaps - how to assign contiguous ascending values?

I have used the code below to "bin" a year.month string into three month bins. The problem is that I want each of the bins to have a number that corresponds where the bin occurs chronologically (i.e. first bin =1, second bin=2, etc.). Right now, the first month bin is assigned to the number 4, and I am not sure why. Any help would be highly appreciated!
> head(Master.feed.parts.gn$yr.mo, n=20)
[1] "2007.10" "2007.10" "2007.10" "2007.11" "2007.11" "2007.11" "2007.11" "2007.12" "2008.01"
[10] "2008.01" "2008.01" "2008.01" "2008.01" "2008.02" "2008.03" "2008.03" "2008.03" "2008.04"
[19] "2008.04" "2008.04"
>
> yearmonth_to_integer <- function(xx) {
+ yy_mm <- as.integer(unlist(strsplit(xx, '.', fixed=T)))
+ return( (yy_mm[1] - 2006) + (yy_mm[2] %/% 3) )
+ }
>
> Cluster.GN <- sapply(Master.feed.parts.gn$yr.mo, yearmonth_to_integer)
> Cluster.GN
2007.10 2007.10 2007.10 2007.11 2007.11 2007.11 2007.11 2007.12 2008.01 2008.01 2008.01
4 4 4 4 4 4 4 5 2 2 2
2008.01 2008.01 2008.02 2008.03 2008.03 2008.03 2008.04 2008.04 2008.04 2008.04 2008.05
2 2 2 3 3 3 3 3 3 3 3
2008.05 2008.05 2008.06 2008.10 2008.11 2008.11 2008.12 <NA> 2009.05 2009.05 2009.05
3 3 4 5 5 5 6 NA 4 4 4
2009.06 2009.07 2009.07 2009.07 2009.09 2009.10 2009.11 2010.01 2010.02 2010.02 2010.02
5 5 5 5 6 6 6 4 4 4 4
UPDATE:
I was asked to provide sample input (year) and the desired output (Cluster.GN).I have a year-month string that has varying numbers of observations for each month, and some months don't have any observations. What I want to do is bin each of the three consecutive months that have data, assigning each three month "bin" a number as shown below.
yr.mo Cluster.GN
1 2007.10 1
2 2007.10 1
3 2007.10 1
4 2007.10 1
5 2007.10 1
6 2007.11 1
7 2007.11 1
8 2007.11 1
9 2007.11 1
10 2007.12 1
11 2007.12 1
12 2007.12 1
13 2007.12 1
14 2008.10 2
15 2008.10 2
16 2008.10 2
17 2008.10 2
18 2008.12 2
19 2008.12 2
20 2008.12 2
21 2008.12 2
22 2008.12 2
1) Convert the strings to zoo's "yearqtr" class and then to integers:
s <- c("2007.10", "2007.10", "2007.10", "2007.11", "2007.11", "2007.11",
"2007.11", "2007.12", "2008.01", "2008.01", "2008.01", "2008.01",
"2008.01", "2008.02", "2008.03", "2008.03", "2008.03", "2008.04",
"2008.04", "2008.04")
library(zoo)
yq <- as.yearqtr(s, "%Y.%m")
as.numeric(factor(yq))
## [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3
The last line could alternately be: 4*(yq - yq[1])+1
Note that in the question 2007.12 is classified as in a different quarter than 2007.10 and 2007.11; however, they are all in the same quarter and we assume you did not intend this.
2) Another possibility depending on what you want is:
f <- factor(s)
nlev <- nlevels(f)
levels(f) <- gl(nlev, 3, nlev)
f
## [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3
## Levels: 1 2 3
IF there are missing months then this will give a different answer than (1) so it all depends on what you are looking for.

Get frequencies (absolute and relative) of levels of a categorical variable from incidence binary data by combination of columns factors

I would like to have the frequencies of each levels of a categorical variable (row vector) denoting ecological type (3 levels: H,F,T) of a set of 93 herbaceous plants for the observed species present (=1) conditioning by sites (3 levels: A,B,C), habitats (3 levels: 1,2,3,4) and years (3 levels: 1,2,3).
I know the procedure is passed by tapply(), but the messy thing come from the logic operator for linking levels of the categorical variable (H,F,T) for the present species (=1) accross all of the species conditioning by combination of columns factors.
This could be summarized by a 12 x 3 contingency table indicating the numbers of each ecological types (3) of species per sites (3) and habitats (4).
Ex of my data (each habitat contain 20 lines): for each species (Sp1 to Sp93) 0 for absent and 1 for present. Vector "type" contain ecological type for each species.
Site,Habitat,Year,Sp1,Sp2,Sp3,Sp4,Sp5,Sp6,...,Sp93
type= c(H,H,F,T,F,T,H,....T) # vector of length 93
Thank you in advance.
I hope this would help describe my data objects better.
data = read.csv(file = "Veg_06.csv", header = TRUE)
data = data[1:240, -c(1,4:7)]
Ilot #
Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ... each level has 4 sublevels (from "Site") with 20 lines each, adding up to 80 lines by levels.
Site #
Factor w/ 4 levels "Am","Av","CP","CS": 2 2 2 2 2 2 2 2 2 2 ...
Sp #
int [1:240] 0 0 0 0 0 0 0 0 0 0 ... either "0" or "1" for absence or presence of species.
veg #
Factor w/ 3 levels "H","F","T": 3 3 2 2 3 1 2 1 2 1 ... categorical factor indicating type of species.
First off, I would recommend http://vita.had.co.nz/papers/tidy-data.pdf, Hadley Wickham's paper on Tidy Data, for some ideas on how to organize the data to be better suited to analysis. In essence, we think of each row as a single observation.
It sounds like fundamentally, your data is a collection of year, site, habitat, quadrant(? maybe line, not sure from the description), species with the observation point being that species was observed in that site, habitat, quadrant, and year. For simplicity, a row is present if the species is present.
In addition, there's the concept of type, which is associated with each species.
Analyzing and contingency table
Putting aside the question of how to get your data into this form, let's assume that we have the data in the form described above.
> raw <- expand.grid(species=1:93, quadrant=1:20, habitat=1:4, site=1:3, year=1:3)
> head(raw)
species quadrant habitat site year
1 1 1 1 1 1
2 2 1 1 1 1
3 3 1 1 1 1
4 4 1 1 1 1
5 5 1 1 1 1
6 6 1 1 1 1
And let's take a small sample and a large sample
> set.seed(100); d.small <- raw[sample(nrow(raw),20), ]
> set.seed(100); d.large <- raw[sample(nrow(raw),1000), ]
We can use the ftable function to get this into a state that we want, the 12x4 contingency table, as
> ftable(habitat ~ year + site, data=d.small)
habitat 1 2 3 4
year site
1 1 0 0 1 0
2 0 0 1 1
3 0 1 1 1
2 1 2 1 1 0
2 1 1 0 2
3 0 0 1 0
3 1 2 0 0 1
2 0 1 0 1
3 0 0 0 0
This will count the same species twice if it occurs in two different quadrants of the site/habitat mixture. We can discard the habitat and unique-ify to get the count across all of them
> ftable(habitat ~ year + site , data=unique(d.small[c('species', 'habitat','year','site')]))
Transforming (tidying the source data)
To transform the data as it stands into a form like this is tricky in vanilla R. With the tidyr package it gets easier (reshape does very similar things as well)
> onerow <- data.frame(year=1, site=1, habitat=2, quadrant=3, sp1=0, sp2=1,sp3=0,sp4=0,sp5=1)
> onerow
year site habitat quadrant sp1 sp2 sp3 sp4 sp5
1 1 1 2 3 0 1 0 0 1
Here I'm making assumptions about what your data look like that seem reasonable
> subset(gather(onerow, species, present, -(year:quadrant)), present==1)
year site habitat quadrant species present
2 1 1 2 3 sp2 1
5 1 1 2 3 sp5 1
> subset(gather(onerow, species, present, -(year:quadrant)), present==1, select=-present)
year site habitat quadrant species
2 1 1 2 3 sp2
5 1 1 2 3 sp5
And now you can proceed with the analysis above.
Merging in the species type data
Looking at your description a little closer, I think you also want to merge in a parallel vector of species type information.
> set.seed(100); sp.type <- data.frame(species=1:93, type=factor(sample(1:4, 93, replace=T)))
> merge(d.small, sp.type)
species quadrant habitat site year type
1 6 16 4 2 3 2
2 27 9 2 2 2 4
3 27 8 4 2 1 4
4 32 18 1 2 2 4
5 33 18 1 1 2 2
6 45 14 4 2 2 3
7 49 6 2 3 1 1
8 54 3 3 2 1 2
9 55 2 1 1 3 3
10 56 2 4 3 1 2
11 56 1 3 1 1 2
12 57 7 2 1 2 1
13 62 18 4 2 2 3
14 70 19 1 1 2 3
15 77 2 3 3 1 4
16 80 7 3 1 2 1
17 81 17 1 1 3 2
18 82 5 2 2 3 3
19 86 9 4 1 3 3
20 87 10 3 3 2 3
And now you can use the subset, unique, and ftable approach above to get the data you need.
Assuming you had a dataframe with (among other things) the columns named: "sites", "habitats", "years":
dfrm <- data.frame( sites = sample( LETTERS[1:3], 20, replace=TRUE),
habitats= sample( factor(1:4), 20, replace=TRUE),
years = sample( factor(paste("Y",1:4, sep="_")), 20, replace=TRUE) )
Then this will give you an additional factor-mode column that encodes the various levels of each row.
dfrm$three.way.inter <- with(dfrm, interaction(sites, habitats, years))
If you want non-populated levels then do nothing else. If you want possible levels that have no instances, then use drop=TRUE. Then you can analyze these within individual levels of the three classification variables.

how to create the frame data structure with columns from csv data in R?

Below are the first five rows of the imported data in R:
data[1:5,]
user event_date day_of_week
1 00002781A2ADA816CDB0D138146BD63323CCDAB2 2010-09-04 Saturday
2 00002D2354C7080C0868CB0E18C46157CA9F0FD4 2010-09-04 Saturday
3 00002D2354C7080C0868CB0E18C46157CA9F0FD4 2010-09-07 Tuesday
4 00002D2354C7080C0868CB0E18C46157CA9F0FD4 2010-09-08 Wednesday
5 00002D2354C7080C0868CB0E18C46157CA9F0FD4 2010-09-17 Friday
distinct_events_a_count total_events_a_count
1 2 2
2 2 2
3 1 3
4 1 1
5 1 1
events_a_duration distinct_events_b_count total_events_b_count
1 615 1 1
2 77 1 1
3 201 1 1
4 44 1 1
5 3 1 1
events_b_duration
1 47
2 43
3 117
4 74
5 18
The problem is that the columns 6 and 9 are read as factors and not numerics therefore I can't perform math operations. In order to convert the imported data to appropriate format I tried to create the structure dataset the following way:
dataset<-data.frame(events_a_duration=as.numeric(c(data[,6])), events_b_duration=as.numeric(c(data[,9])))
but checking the values I noticed that the frame structure doesn't contain the appropriate values:
dataset[1,]
events_a_duration events_b_duration
1 10217 6184
The values should be 615 and 47.
So what I don't know is how to create the frame data structure that consists of imported data columns and would be very thankful if anyone could show the way to create the appropriate data structure.
Your problem is that you are converting factors to integers by using the numbers of classes instead of the corresponding values. You can check that classes are numbered in ascending order of the values:
> as.numeric(factor(c(615,47,42)))
[1] 3 2 1
> as.numeric(factor(c(615,42,47)))
[1] 3 1 2
> as.numeric(factor(c(615,42,47,37)))
[1] 4 2 3 1
> as.numeric(factor(c(615,42,37,47)))
[1] 4 2 1 3
Use as.numeric(as.character(MyFactor)). See below for instance:
> as.numeric(as.character(factor(c(615,42,37,47))))
[1] 615 42 37 47
data <- read.csv ("data.csv", stringsAsFactors=FALSE)

Resources