R ggplot bar graph has extra lines at the base of columns - r

I have two main issues I could use some help getting resolved.
1.) There are odd lines at the base of my columns which I am not sure how to get rid of.
2.) I am running into overlap with the columns when I graph. (I think this has something do do with the position_dodge(width= XXX) but not totally sure).
Attached an image of an example plot, mainly because I am not sure how to describe what is happening at the base of the plot.
The following code is being used.
where_2 <- where %>%
group_by_("gender", "radio") %>%
summarise(count = n()) %>%
mutate(perc = (perc = (count / sum(count) * 100)))
gg <- ggplot(where_2, aes_string(x = names(where_2[1]), y = where_2$perc, fill = "radio"))
gg <- gg + geom_bar(aes(y = (..count..) / sum(..count..)))
gg <-gg + geom_bar(position = position_dodge(.5),stat = "identity", width = .75)
#gg <- gg + scale_y_continuous(labels = scales::percent)
gg <- gg + xlab(paste0(lab5[2, title]))
gg <- gg + scale_fill_discrete(labels = c("Yes", "No"))
print(gg)
I have been running in a wall for the past 4 days with this question any help would be appreciated.
place gender Radio
1 Male No
1 Female Yes
1 Male No
1 Female Yes
1 Male Yes
1 Male Yes
1 Female Yes
1 Female Yes
1 Male Yes
1 Female No
1 Male Yes
1 Male Yes
1 Male No
1 Female No
1 Female Yes
1 Female Yes
1 Female No
1 Male Yes
1 Female No
1 Female Yes
1 Female No
1 Female Yes
1 Male No
1 Male No
1 Female No
1 Male No
1 Female No
1 Female No
1 Female No
1 Male Yes
1 Female No
1 Female No
1 Female Yes
1 Male No
1 Male Yes
1 Female No
2 Male Yes
2 Male Yes
2 Female No
2 Female No
2 Male Yes
2 Female No
2 Male No
2 Male Yes
2 Female No
2 Female No
2 Female No
2 Male No
2 Female No
2 Male No
2 Female Yes
2 Female Yes
2 Male Yes
2 Male No
2 Male Yes
3 Female No
3 Male Yes
3 Female No
3 Male No
3 Male Yes
3 Female No
3 Female Yes
3 Male No
3 Male Yes
3 Female Yes
3 Male No
3 Female No
3 Female Yes
3 Female No
3 Female Yes
3 Female No
3 Male Yes
3 Female No
3 Female No
4 Male Yes
4 Female No
4 Female Yes
4 Female Yes
4 Male Yes
4 Female No
4 Female No
4 Male No
4 Female No
4 Female No
4 Female No
4 Male Yes
4 Male Yes
4 Female Yes
4 Female No
4 Male Yes
4 Male Yes
4 Male Yes
4 Female No
4 Female No
4 Female No

Try this:
gg <- ggplot(where2,
aes(x = gender, y = perc, fill = Radio)) +
geom_col(position = "dodge", width = .75)
print(gg)
Explanation below:
You are right that the "feet" are indeed caused by geom_bar(aes(y = (..count..) / sum(..count..))). I'm not sure why you included it in the first place, but here's why it created the "feet":
Good chart
p <- ggplot(where2, aes(x = gender, y = perc, fill = Radio))
p + geom_col(position = position_dodge(0.5), width = 0.75)
Above is the chart you want to get (I assume). geom_col() is equivalent to geom_bar(stat = "identity") with less typing, so I used that instead.
Usually people set the same value in position_dodge() and width =, which would avoid the overlapped look. I've retained it for now to contrast with the "feet" below.
Notice also the values on the y-axis. They range from 0 to 60+.
Bad chart
p + geom_bar(aes(y = (..count..) / sum(..count..)))
Above is the chart of the "feet", now occupying the entire plot's height. Here, ..count.. returns the number of rows for each combination of gender & Radio, while sum(..count..) returns the total number of rows in the data frame. The data frame, where2, has 4 rows, one for each combination, so the y value associated with each bar is 0.25, and the stacked height of each gender's two bars is 0.5.
I consider this the bad chart, because the visualisation is useless. When you have already counted the number of rows in your dataset yourself (going from where to where2), it's not necessary for ggplot to do it again.
Good chart + bad chart = weird chart
p +
geom_col(position = position_dodge(0.5), width = 0.75) +
geom_bar(aes(y = (..count..) / sum(..count..)))
Above is the combined chart with both layers. Now the bad chart's bars are squeezed all the way to the bottom, since their combined height is only 0.5, while the good chart's bars stretch all the way to 60+.
data used:
> dput(where)
structure(list(place = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L,
4L, 4L, 4L, 4L, 4L, 4L), gender = structure(c(2L, 1L, 2L, 1L,
2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 1L, 1L,
1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 2L, 1L,
2L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 1L,
2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L,
1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L,
1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Female",
"Male"), class = "factor"), Radio = structure(c(1L, 2L, 1L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 2L,
1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L,
2L, 2L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L,
2L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 2L,
1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("No",
"Yes"), class = "factor")), .Names = c("place", "gender", "Radio"
), class = "data.frame", row.names = c(NA, -95L))
where2 <- where %>%
group_by(gender, Radio) %>%
summarise(count = n()) %>%
mutate(perc = (perc = (count / sum(count) * 100)))
> where2
# A tibble: 4 x 4
# Groups: gender [2]
gender Radio count perc
<fctr> <fctr> <int> <dbl>
1 Female No 37 67.3
2 Female Yes 18 32.7
3 Male No 15 37.5
4 Male Yes 25 62.5

Related

How to plot multiple ACF values on the one graph

I've just started using R and would like to use look at the autocorrelation in my data using ACF. My dataframe (GL) looks something like this
GL
well year month value area
684 1994 Jan 8.53 H
684 1994 Feb 8.62 H
684 1994 Mar 8.12 H
684 1994 Apr 8.21 H
684 1995 Jan 8.53 H
684 1995 Feb 8.62 H
684 1995 Mar 8.12 H
684 1995 Apr 8.21 H
684 1996 Jan 8.53 H
684 1996 Feb 8.62 H
684 1996 Mar 8.12 H
684 1996 Apr 8.21 H
101 1994 Jan 8.53 R
101 1994 Feb 8.62 R
101 1994 Mar 8.12 R
101 1994 Apr 8.21 R
101 1995 Jan 8.53 R
101 1995 Feb 8.62 R
101 1995 Mar 8.12 R
101 1995 Apr 8.21 R
101 1996 Jan 8.53 R
101 1996 Feb 8.62 R
101 1996 Mar 8.12 R
101 1996 Apr 8.21 R
I would like to:
1. Calculate ACF for each well using lappy or some kind of loop (my actual data set has about 100 wells and three groups)
2. Plot the ACF values (as lines) for each well on one graph for each group (so in this case I would have two acf graphs H & R.
I can use split and lapply to calculate ACF for each well e.g.
split <- split(GL$value,GL$well)
test <- lapply(split,acf)
But splitting this way doesn't save the area information. If I split like this:
split1 <- split(GL,GL$well)
Then I don't know how to perform lapply on the values for each well.
As you split the data by well,
spl1 <- split(GL, GL$well)
the lapply would look like this.
lapply(spl1, function(x) acf(x$value))
We could make this somewhat nicer, though.
When we do the lapply by list number we get a "counter" with which we can access the list names to paste together informative titles. With par(mfrow=c(<rows>, <columns>)) we can set the arrangement of the plots.
par(mfrow=c(1, 2))
lapply(seq_along(spl1), function(x) acf(spl1[[x]]$value,
main=paste0("well ", names(spl1)[x], ", ",
"area ", unique(spl1[[x]]$area))))
Result
This will probably have to be adapted according to how your wells are divided into groups.
(As a sidenote: Better avoid overwriting function names. You use split() and give the result the same name as the function which could induce confusion, both of yourself and of R. Other popular candidates are data, df, table. We can always quickly check with ? whether the name is "free", e.g. ?df.)
Data
# result of `dput(GL)`
GL <- structure(list(well = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("101", "684"), class = "factor"), year = structure(c(1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 2L,
2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1994", "1995", "1996"
), class = "factor"), month = structure(c(3L, 2L, 4L, 1L, 3L,
2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L,
2L, 4L, 1L), .Label = c("Apr", "Feb", "Jan", "Mar"), class = "factor"),
value = structure(c(3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L,
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L), .Label = c("8.12",
"8.21", "8.53", "8.62"), class = "factor"), area = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("H", "R"), class = "factor")), row.names = c(NA,
-24L), class = "data.frame")
You can solve it with data.table:
Let's start with the data (slightly modified from yours, so there will be different values for each well):
structure(list(well = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L), .Label = c("101", "684"), class = "factor"), year = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L), .Label = c("1994", "1995", "1996"), class = "factor"), month = structure(c(3L, 2L, 4L, 1L, 3L,
2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L), .Label = c("Apr", "Feb", "Jan", "Mar"), class = "factor"),
value = c(4.65144120692275, 8.98342372477055, 17.983893298544,
15.3687085728161, 8.9577708535362, 7.47583840973675, 16.6564453896135, 11.6158618542831, 23.6109819535632, 14.1604918171652, 11.3882310683839, 20.4579487598967, 3.31275907787494, 22.109053656226, 13.598402187461, 12.3686389743816, 17.9585587936454, 17.3689122993965, 7.38424337399192, 6.93579732463695, 13.2789171519689, 21.2500206897967, 13.5766511948314, 3.58588649751619), area = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("H", "R"), class = "factor")), row.names = c(NA, -24L), class = c("data.table", "data.frame"))
Then we create a list for each well:
GL[, datos := .(list(value)) , by = well]
Each row in the datos variable will have a list with all the values corresponding to the well, so we can drop most of them and keep only the first row of each well, as it has all the information already. That is done with GL[, .SD[1,], by = well] so the result will be a two-row data table. After that, we can chain another expression that will produce and save each plot:
GL[, .SD[1,], by = well][
, {png(filename = paste0(well, "-", area, ".png"),
width = 1600,
height = 1600,
units = "px",
res = 100);
plot(a[[1]], main = paste("Well:", well,
"Area:", area, sep = " "));
dev.off()},
by = well]
Your two plots will be saved in the current directory with names like "684-H.png" and "101-R.png".
Key point here: data.table takes expressions and not just functions, so it's absolutely possible to produce the plots and save them to any given location.

hist 2 variable with different x value

I have a data frame 'TOT', with the value of abundance of a species and values of temperature, something like this:
AB_specie : int 2 1 11 1 6 2 1 8 1 3
TEMP : num 24.8 21 24.8 25.6 24.8 ..
structure(list(AB_specie = c(2L, 1L, 11L, 1L, 6L, 2L, 1L, 8L, 1L, 3L, 1L, 1L, 2L, 1L, 65L, 1L, 5L, 2L, 2L, 1L, 2L, 2L, 16L,1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 5L, 1L, 1L, 2L, 1L, 1L, 1L,2L, 2L, 4L, 2L, 1L, 1L, 2L, 1L, 1L, 1L, 4L), TEMP = c(24.8332,20.9689, 24.7757, 25.648, 24.7579, 25.2056, 25.6153,23.1105,25.0967, 25.0202, 23.4963, 24.9033, 25.0009, 25.1767, 24.6222,26.5870991, 24.9885006,26.8768997, 25.2455006, 25.8085995, 26.8239002,26.6907997, 27.2084007, 27.2700005, 27.1746998, 27.2026005, 25.3586006,25.4300003, 24.9193001, 27.0338001, 23.0319004, 26.4820995, 25.2614994,27.2424, 27.664, 27.4767, 27.4602, 26.3897, 24.5804, 26.9536,26.3928, 26.1778, 27.1487, 26.3726, 26.6156, 26.5343, 23.7879,24.6767, 26.834, 26.9746)), class = "data.frame")
I want to hist the frequency of temperature, and see how the frequency of abundance of the species overlaps. To understand at what temperatures I have more abundance of the species
I did this :
hist(TOT$AB_specie, breaks=30, xlim=c(0,15), col=rgb(1,0,0,0.5), xlab="",
ylab="FREQUENCY", main="" )
hist(TOT$TEMP, breaks=30, xlim=c(0,15), col=rgb(0,0,1,0.5), add=T)
legend("topright", legend=c("AB_specie","TEMP"), col=c(rgb(1,0,0,0.5),
rgb(0,0,1,0.5)), pt.cex=2, pch=15 )
But obviously, having two different values of x, I get something like this
can you help me understand how I can graph this information? Excuse me, I'm a beginner
And this is what I have with scatterplot:
enter image description here
enter image description here

Large data set cleaning: How to fill in missing data based on multiple categories and searching by row order

This is my first StackOverflow post, so I hope that it isn't too difficult to understand.
I have a large dataset (~14,000) rows of bird observations. These data were collected by standing in one place (point) and counting birds that you see within 3 minutes. Within each point-count a new bird observation becomes a new row, so that there are many repeated dates, times, sites, and point (specific location within a site). Again, each point count is 3 minutes long. So if you see a yellow warbler (coded as YEWA) during minute 1, then it will be associated with MINUTE=1 for that specific point count (date, site, point, and time). ID=observer intials and Number=number of birds spotted (not necessarily important here).
However, if NO BIRDS are seen, then a "NOBI" goes into the dataset for that specific minute. Thus, if there are NOBI for an entire 3 minute point count, their will be three rows with the same date, site, point, and time, and "NOBI" in the "BIRD" column for each of the three rows.
So I have TWO main problems. The first is that some observers entered "NOBI" once for all three minutes, instead of three times (once per minute). Anywhere where "MINUTE"
has been left blank (becoming NA), AND "BIRD"="NOBI", I need to add three rows of data, all with the same values for all columns except "MINUTE", which should be 1, 2, and 3 for the respective rows.
If it looks like this:
ID DATE SITE POINT TIME MINUTE BIRD NUMBER
1 BS 5/9/2018 CW2 U125 7:51 NA NOBI NA
2 BS 5/9/2018 CW1 D250 8:12 1 YEWA 2
3 BS 5/9/2018 CW1 D250 8:12 2 NOBI NA
4 BS 5/9/2018 CW1 D250 8:12 3 LABU 1
It should look like this instead:
ID DATE SITE POINT TIME MINUTE BIRD NUMBER
1 BS 5/9/2018 CW2 U125 7:51 1 NOBI NA
2 BS 5/9/2018 CW2 U125 7:51 2 NOBI NA
3 BS 5/9/2018 CW2 U125 7:51 3 NOBI NA
4 BS 5/9/2018 CW1 D250 8:12 1 YEWA 2
5 BS 5/9/2018 CW1 D250 8:12 2 NOBI NA
6 BS 5/9/2018 CW1 D250 8:12 3 LABU 1
note: If you are wanting to enter some of this data into your R console, I included some at the end of this post using dput, which should be easier to enter than copy-and-pasting the above
I have made failed attempts at reproducing if statements with multiple conditions (based on:
R multiple conditions in if statement & Ifelse in R with multiple categorical conditions) I tried writing this many ways, including with piping from dplyr, but see below for one example of some code, notes, and error messages.
>if(PC$BIRD == "NOBI" & PC$MINUTE==NA){PC$Fix<-TRUE}
Error in if (PC$BIRD == "NOBI" & PC$MINUTE == NA) { :
missing value where TRUE/FALSE needed
In addition: Warning message:
In if (PC$BIRD == "NOBI" & PC$MINUTE == NA) { :
the condition has length > 1 and only the first element will be used
## Then I need to do something like this:
>if(PC$Fix<-TRUE){duplicate(row where Fix==TRUE, times=2)} #I know this isn't
### even close, but I want the row to be replicated two more times so
### that there are 3 total rows witht he same values
### Fix indicates that a fix is needed in this example
# Then somehow I need to assign a 1 to PC$MINUTE for the first row (original row),
# a 2 to the next row (with other values from other columns being the same), and a 3
# to the last duplicated row (still other values from other columns being the same)
The second problem, which seems more difficult to me is to search the dataset in order or perhaps by DATE,SITE,POINT, and TIME in some way. The minute values should always go from 1... to 2... to 3, and then back to 1 for the next set of date, time, site, and point. That is, each point count should have all values 1:3. However, one count may have multiple sightings in MINUTE=1 so that there are 5 or 6 (or 20) MINUTE=1 before MINUTE=2. BUT, again, some observers in this dataset simply left a row out when there was NO BIRDS (NOBI), instead of writing a row with BIRD="NOBI" for each MINUTE. That is if the dataset goes:
ID DATE SITE POINT TIME MINUTE BIRD NUMBER
...
4 BS 5/9/2018 CW2 U125 7:54 1 AMRO 1
5 BS 5/9/2018 CW2 U125 7:54 1 SPTO 1
6 BS 5/9/2018 CW2 U125 7:57 1 AMRO 1
7 BS 5/9/2018 CW2 U125 7:57 1 SPTO 1
8 BS 5/9/2018 CW2 U125 7:57 1 AMCR 3
9 BS 5/9/2018 CW2 U125 7:57 2 SPTO 1
10 BS 5/9/2018 CW2 U125 7:57 2 HOWR 1
11 BS 5/9/2018 CW2 U125 7:57 3 UNBI 1
We can see that the 7:57 point count time is complete (there are MINUTE values of 1:3). However, the 7:54 point count time stops at MINUTE=1. Meaning, I need to enter two more rows underneath that have all of the same DATE,SITE,POINT,TIME information, but with MINUTE=2 and BIRD="NOBI" for the first added row and MINUTE=3 and BIRD="NOBI" for the second added row. So it should look like this:
ID DATE SITE POINT TIME MINUTE BIRD NUMBER
...
4 BS 5/9/2018 CW2 U125 7:54 1 AMRO 1
5 BS 5/9/2018 CW2 U125 7:54 1 SPTO 1
6 BS 5/9/2018 CW2 U125 7:54 2 NOBI NA
7 BS 5/9/2018 CW2 U125 7:54 3 NOBI NA
8 BS 5/9/2018 CW2 U125 7:57 1 AMRO 1
9 BS 5/9/2018 CW2 U125 7:57 1 SPTO 1
10 BS 5/9/2018 CW2 U125 7:57 1 AMCR 3
11 BS 5/9/2018 CW2 U125 7:57 2 SPTO 1
12 BS 5/9/2018 CW2 U125 7:57 2 HOWR 1
13 BS 5/9/2018 CW2 U125 7:57 3 UNBI 1
Lastly, I understand that this is a long and complicated question, and I may not have articulated it very well. Please let me know if there is any clarification needed, and I would be happy to hear any advice, even if it doesn't fully answer my problems. Thank you in advance!
Everything below this line is only useful for you if you want to enter a sample of my data into R
To enter my data into R console, copy and paste everything from "structure" function to end of code to enter it as dataframe in R console with code: dataframe<-structure(list...
See Example of using dput() for help.
PC<-read.csv("PC.csv") ### ORIGINAL FILE
dput(PC)
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "BS", class = "factor"),
DATE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "5/9/2018", class = "factor"),
SITE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "CW2", class = "factor"),
POINT = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("M", "U125"), class = "factor"),
TIME = structure(c(8L, 8L, 8L, 9L, 9L, 10L, 10L, 10L, 10L,
10L, 10L, 11L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 4L,
4L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 7L), .Label = c("6:48", "6:51",
"6:54", "6:57", "7:12", "7:15", "7:18", "7:51", "7:54", "7:57",
"8:00"), class = "factor"), MINUTE = c(1L, 2L, 3L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 3L, 1L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 1L,
1L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 3L, 3L, NA, NA), BIRD = structure(c(6L,
6L, 6L, 2L, 7L, 2L, 7L, 1L, 7L, 5L, 8L, 8L, 6L, 6L, 6L, 6L,
6L, 6L, 7L, 7L, 7L, 7L, 6L, 8L, 3L, 7L, 9L, 5L, 4L, 2L, 6L,
6L), .Label = c("AMCR", "AMRO", "BRSP", "DUFL", "HOWR", "NOBI",
"SPTO", "UNBI", "VESP"), class = "factor"), NUMBER = c(NA,
NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA,
NA, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA,
NA)), class = "data.frame", row.names = c(NA, -32L))
PCc<-read.csv("PC_Corrected.csv") #### WHAT I NEED MY DATABASE TO LOOK LIKE
dput(PCc)
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "BS", class = "factor"), DATE = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "5/9/2018", class = "factor"),
SITE = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L), .Label = "CW2", class = "factor"), POINT = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("M",
"U125"), class = "factor"), TIME = structure(c(8L, 8L, 8L,
9L, 9L, 9L, 9L, 10L, 10L, 10L, 10L, 10L, 10L, 11L, 11L, 11L,
1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 5L,
5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 7L, 7L, 7L), .Label = c("6:48",
"6:51", "6:54", "6:57", "7:12", "7:15", "7:18", "7:51", "7:54",
"7:57", "8:00"), class = "factor"), MINUTE = c(1L, 2L, 3L,
1L, 1L, 2L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L, 1L, 2L, 3L, 1L, 1L, 2L, 3L, 1L, 1L, 2L, 3L, 1L, 1L, 1L,
2L, 3L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), BIRD = structure(c(6L,
6L, 6L, 2L, 7L, 6L, 6L, 2L, 7L, 1L, 7L, 5L, 8L, 8L, 6L, 6L,
6L, 6L, 6L, 6L, 6L, 6L, 7L, 7L, 6L, 6L, 7L, 7L, 6L, 8L, 3L,
7L, 9L, 5L, 4L, 2L, 6L, 6L, 6L, 6L, 6L, 6L), .Label = c("AMCR",
"AMRO", "BRSP", "DUFL", "HOWR", "NOBI", "SPTO", "UNBI", "VESP"
), class = "factor"), NUMBER = c(NA, NA, NA, 1L, 1L, NA,
NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, NA, NA, NA, NA, NA, NA,
NA, 1L, 1L, NA, NA, 1L, 1L, NA, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
NA, NA, NA, NA, NA, NA)), class = "data.frame", row.names = c(NA,
-42L))
Here's a way to do it using dplyr and tidyr from the tidyverse meta-package.
# Step one - identify missing rows.
# For each DATE, SITE, POINT, TIME, count how many of each minute
library(tidyverse)
# Convert factors to character to make later joining simpler,
# and fix missing ID's by assuming prior line should be used,
# and make NOBI rows have a count of NA
PC_2_clean <- PC %>%
mutate_if(is.factor, as.character) %>%
fill(ID, .direction = "up") %>%
mutate(NUMBER = if_else(BIRD == "NOBI", NA_integer_, NUMBER))
# Create a wide table with spots for each minute. Missing will
# show up as NA's
# All the NA's here in the 1, 2, and 3 columns represent
# missing minutes that we should add.
PC_3_NA_find <- PC_2_clean %>%
count(ID, DATE, SITE, POINT, TIME, MINUTE) %>%
spread(MINUTE, n)
PC_3_NA_find
# A tibble: 11 x 9
# ID DATE SITE POINT TIME `1` `2` `3` `<NA>`
# <chr> <chr> <chr> <chr> <chr> <int> <int> <int> <int>
# 1 BS 5/9/2018 CW2 M 7:12 3 1 2 NA
# 2 BS 5/9/2018 CW2 M 7:15 NA NA NA 1
# 3 BS 5/9/2018 CW2 M 7:18 NA NA NA 1
# 4 BS 5/9/2018 CW2 U125 6:48 1 1 1 NA
# 5 BS 5/9/2018 CW2 U125 6:51 1 1 1 NA
# 6 BS 5/9/2018 CW2 U125 6:54 2 NA NA NA
# 7 BS 5/9/2018 CW2 U125 6:57 2 1 1 NA
# 8 BS 5/9/2018 CW2 U125 7:51 1 1 1 NA
# 9 BS 5/9/2018 CW2 U125 7:54 2 NA NA NA
# 10 BS 5/9/2018 CW2 U125 7:57 3 2 1 NA
# 11 BS 5/9/2018 CW2 U125 8:00 1 NA NA NA
# Take the NA minute entries and make the desired line for each
PC_4_rows_to_add <- PC_3_NA_find %>%
gather(MINUTE, count, `1`:`3`) %>%
filter(is.na(count)) %>%
select(-count, -`<NA>`) %>%
mutate(MINUTE = as.integer(MINUTE),
BIRD = "NOBI",
NUMBER = NA_integer_)
# Add these lines to the original, remove the NA minute rows
# (these have been replaced with minute rows), and sort
PC_5_with_NOBIs <- PC_2_clean %>%
bind_rows(PC_4_rows_to_add) %>%
filter(MINUTE != "NA") %>%
arrange(ID, DATE, SITE, POINT, TIME, MINUTE, BIRD)
# Check result
PC_5_with_NOBIs %>%
count(ID, DATE, SITE, POINT, TIME, MINUTE) %>%
spread(MINUTE, n)
PC_5_with_NOBIs
# Now to confirm it matches your desired output.
# Note, I convert to character to avoid mismatches between factors
PCc_char <- PCc %>%
mutate_if(is.factor, as.character) %>%
arrange(ID, DATE, SITE, POINT, TIME, MINUTE, BIRD)
identical(PC_5_with_NOBIs, PCc_char)
# [1] TRUE

Check Dataframe by Row, over multiple columns and Code 1 for positive and 0 for negative

I have a set of columns, all coded as factors. The values are coded as 1 for positive and 0 for negative. Samples on rows, and scores for each on the columns.
I want to find out, sample wise, if there are any positives. If there is at least one positive, I want to generate a new column in the same database which says 1, as in this sample was positive for at least one, or 0 as in this sample was negative for all.
dat3 <- structure(list(A = structure(c(2L, 2L, 2L, 2L, 2L, 1L, 2L, 2L,2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 1L), .Label = c("1", "0"), class = "factor"),
B = structure(c(1L,1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,1L, 1L, 1L, 1L, 2L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
C = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 2L), .Label = c("nd","0", "1"), class = "factor"),
D = structure(c(1L, 1L, 1L,2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L,2L, 1L, 1L, 1L, 1L, 1L), .Label = c("0", "1"), class = "factor"),
E = structure(c(1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("0","1"), class = "factor")),
.Names = c("A", "B", "C", "D", "E"), class = "data.frame", row.names = c(NA, 24L))
I tried and achieved the result I wanted by using if and if else statements, but they are really tedious and I don't think thats the best way to do it. I've been trying the apply function, but I haven't had much success.
The result I'm expecting is
dat3$result <- c(0,0,1,1,1,1,0,1,0,1,0,1,1,0,1,0,1,1,1,0,0,1,0,1)
The 'dat3' columns are all factor, which can be converted to numeric then use rowSums to create a binary column
dat3$result <- as.integer(rowSums(sapply(dat3, function(x)
as.integer(as.character(x))), na.rm = TRUE) > 0)
Or convert to a logical matrix and then do the rowSums
as.integer(rowSums(dat3 == "1")> 0)
#[1] 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 0 1 1 1 0 0 1 0 1

Count the number of unique character elements in one column based on several different (sub-)groupings (columns)

Here is a sample dataset.
test_data <- structure(list(ID = structure(c(4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L,
3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L), .Label = c("P39190",
"U93491", "X28348", "Z93930"), class = "factor"), Sex = structure(c(2L,
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L), .Label = c("F", "M"), class = "factor"), Group = structure(c(2L,
2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 3L,
3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("C83Z", "CAP_1", "P000"), class = "factor"),
Category = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L,
2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L), .Label = c("A",
"B", "C"), class = "factor")), .Names = c("ID", "Sex", "Group",
"Category"), class = "data.frame", row.names = c(NA, -36L))
head(test_data, n = 10)
ID Sex Group Category
1 Z93930 M CAP_1 A
2 Z93930 M CAP_1 A
3 Z93930 M C83Z A
4 Z93930 M C83Z A
5 Z93930 M C83Z A
6 Z93930 M C83Z A
7 X28348 F C83Z B
8 X28348 F C83Z B
9 X28348 F CAP_1 B
10 X28348 F CAP_1 B
I want to count the number of unique elements in three levels:
Count of unique elements per "Category"
Count of unique elements in each "Category" grouped by "Group"
Count of unique elements in each "Group" grouped by "Sex"
I can of course use base R and a bit of dplyr to achieve this:
library(dplyr)
for(i in 1:length(unique(test_data$Category))){
temp <- test_data %>% dplyr::filter(Category == unique(test_data$Category)[i])
message(paste0(unique(test_data$Category)[i]), ": ", length(unique(temp$ID)))
for(k in 1:length(unique(temp$Group))){
temp_grp <- temp %>% dplyr::filter(Group == unique(temp$Group)[k])
message(paste0("\n ├──", unique(temp$Group)[k],
": ", length(unique(temp_grp$ID))))
message(paste0("\n\t"), "F: ", length(unique(temp_grp[which(temp_grp$Sex == "F"),])$ID))
message(paste0("\n\t"), "M: ", length(unique(temp_grp[which(temp_grp$Sex == "M"),])$ID))
}
}
But this is too dirty and unclever.
Is there a function in R that can achieve this in a cleaner and more efficient manner and preferably produce the output in the form of a dataframe?
I was under the impression that dplyr::group_by was made for such tasks. But I cannot quite figure out how it works for sub-groupings.
The code below:
test_data %>% dplyr::group_by(Category) %>% summarise(n = n_distinct(ID))
achieves the first task (point 1. above). But I cannot achieve points 2 and 3 in the same way.
SOLUTION:
test_data %>% dplyr::group_by(Category, Group, Sex) %>% summarise(n = n_distinct(ID))
If I understand your question correctly, you were not very far from it at all. The idea is just to group by two columns at a time this way: group_by(col1, col2).
For point 2:
test_data %>% dplyr::group_by(Category, Group) %>% summarise(n = n_distinct(ID))
Source: local data frame [9 x 3]
Groups: Category [?]
Category Group n
<fctr> <fctr> <int>
1 A C83Z 1
2 A CAP_1 1
3 A P000 2
4 B C83Z 1
5 B CAP_1 1
6 B P000 1
7 C C83Z 1
8 C CAP_1 1
9 C P000 2
And for point 3:
test_data %>% dplyr::group_by(Group, Sex) %>% summarise(n = n_distinct(ID))
If I understand correctly, you can use dplyr::count for all three cases
test_data %>% dplyr::count(Category)
test_data %>% dplyr::count(Group, Category)
test_data %>% dplyr::count(Sex, Group)

Resources