find same group in R [duplicate] - r

This question already has answers here:
Counting the number of elements with the values of x in a vector
(20 answers)
Closed 1 year ago.
I have data table like this
year value
2010 25
2011 168
2012 48
2010 189
2011 192
2012 38
2010 175
2011 55
2012 48
I want to distinguish my data to be like this
year value
2010 3 (25 189 175: = 33.33% )
2011 3 (168 192 55 : = 33.33%)
2012 3 (48 38 48: = 33.33%)
for further plotting bar graph which have 3 main groups (2010 2011 2012) in X-axis and % of members in each year in Y-axis
What should I do? I'm a beginner in R program. Thank you in advanced :D

Do you want this
> data.frame(xtabs(~year, df))
year Freq
1 2010 3
2 2011 3
3 2012 3
or the plot
barplot(prop.table(xtabs(~year, df)))

You should note that for plotting the graph, the transformation you describe is not necessary. But you can do
dplyr::count(your_data, year)
returning
# A tibble: 3 x 2
year n
<dbl> <int>
1 2010 3
2 2011 3
3 2012 3

Related

Subset rows in one data frame based on values from another data frame [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
merging on multiple columns R
(1 answer)
How do I combine two data-frames based on two columns? [duplicate]
(3 answers)
Closed 2 months ago.
This post was edited and submitted for review 2 months ago and failed to reopen the post:
Original close reason(s) were not resolved
I have two data frames: df1 and df2. They both have four columns; three with the same names ID, Year and Week and one that are different from each other.
>df1
ID Year Oxygen Week
---- ------ ------- -------
1 2004 18 1
1 2005 17 1
2 2006 17 1
2 2007 18 1
3 2008 19 1
3 2010 20 1
3 2010 20 1
4 2012 16 1
5 2013 18 1
6 2014 18 1
>df2
ID Year Kg Week
---- ------ ----- -------
1 2004 20 1
1 2005 35 2
2 2006 30 2
3 2007 15 1
3 2008 70 2
4 2009 40 1
5 2013 55 1
6 2012 40 1
6 2014 10 2
7 2013 15 1
I would like to produce a new data frame which contains the rows from df1 only when the combination of ID and Year in df1 also are present in df2. The Week might be the same or not for that row, but I don't want to take the column Week into account. So the first row in df1 has 1 for ID and 2004 for Year which also occurs in df2. The combination of ID and Year for the second row in df1 does also occur in df2but have different value for Week.
I know how to do it if it only depends on one column:
df3 <- subset(df1, ID %in% df2$ID)
There was a solution for this when I didn't have the column Week which was:
df3 <- df1 %>% inner_join(df2)
But I don't know how to make it depend on both the ID and Year at the same time without it also takes Week into account.
I should end up with the following data frame, which only contain the columns from df1:
>df3
ID Year Oxygen Week
---- ------ ------- -------
1 2004 18 1
1 2005 17 1
2 2006 17 1
3 2008 19 1
4 2012 16 1
5 2013 18 1
6 2014 18 1
tidyverse approach
library(dplyr)
df3 <-
df1 %>%
inner_join(df2)

How to keep in data frame every ID once preferring largest value fist in case of tie newest in case of tie random

I got a large DF with about 250K obs. and 19 variables about peoples graduation data. People are in the DF multiple times. I need to make two new DF where every person appears only once (ID) per DF. My true DF has more columns but for the example i took out 3 that are needed and 2 as an example of other background information in my DF.
ID - a persons unique ID
Level - Shows the level of education. Higher the number the better: Example 8 = Phd, 7 = Masters
Year - The year of the graduation
Field - background information on the field of study. We need to keep this data.
Gender - Gender of the person. We need to keep this data.
Orignal DF named "Graduations"
ID Level Year Field Gender
1 4 2016 31 M
1 5 2016 43 M
2 6 2010 12 F
2 7 2012 12 F
2 8 2017 19 F
3 5 2011 12 F
3 5 2009 31 F
4 6 2018 43 M
4 6 2018 44 M
5 5 2015 19 M
5 6 2011 32 M
DF 1 is named "Highest" meaning highest level of graduation. In this DF we only keep rows where the person has reached their highest level of education. If a person has reached it more then once then we keep the row with the highest and newest graduation and if both are a tie then we choose one randomly.
"Highest"
ID Level Year Field Gender
1 5 2016 43 M (Row kept because it was the highest level)
2 8 2017 19 F (Row kept because it was the highest level)
3 5 2011 12 F (Row kept because it was the highest level and newest)
4 6 2018 43 M (Row chosen randomly)
5 6 2011 32 M (Row Kept because it was the Highest level)
DF 2 is named "Last" meaning the last obtained level of graduation. In this DF we only keep row with the newest data. If a person has had more then one graduation per year then we choose the highest level if that is also a tie then at random (Does not need to be the same random row as was in DF1)
"Last"
ID Level Year Field Gender
1 5 2016 43 M (Row kept because it was the newest and highest level)
2 8 2017 19 F (Row kept because it was the newest)
3 5 2011 12 F (Row kept because it was the newest)
4 6 2018 44 M (Row chosen randomly)
5 5 2015 19 M (Row Kept because it was the newest)
I tried to search for it on the stackoverflow but did not find what i was looking for. I might have used wrong keywords.
I prefer base R functions but if needed the most common packages are also ok. This is because this DF is on a secure computer and installing uncommon packages takes forever as i must make a request for it. Most common ones are already installed.
Thank you for your help.
A data.table soolution
It's not quite elegent but it works.
library(data.table)
dt <- fread('ID Level Year Field Gender
1 4 2016 31 M
1 5 2016 43 M
2 6 2010 12 F
2 7 2012 12 F
2 8 2017 19 F
3 5 2011 12 F
3 5 2009 31 F
4 6 2018 43 M
4 6 2018 44 M
5 5 2015 19 M
5 6 2011 32 M')
df1 <- dt[,head(.SD[Level==max(Level)],1),by = .(ID)]
df2 <- dt[,highest_lvl_year:=max(Level),.(ID,Year)][,
flag:=fifelse(Year==max(Year),fifelse(Level==highest_lvl_year,"Y","N"),"N"),by=.(ID)][,
head(.SD[flag=="Y"],1),by=.(ID)]
df1
#> ID Level Year Field Gender
#> 1: 1 5 2016 43 M
#> 2: 2 8 2017 19 F
#> 3: 3 5 2011 12 F
#> 4: 4 6 2018 43 M
#> 5: 5 6 2011 32 M
df2[,1:5]
#> ID Level Year Field Gender
#> 1: 1 5 2016 43 M
#> 2: 2 8 2017 19 F
#> 3: 3 5 2011 12 F
#> 4: 4 6 2018 43 M
#> 5: 5 5 2015 19 M
Created on 2020-04-20 by the reprex package (v0.3.0)

ggplot2 or sjPlot sum stacked barplot columns

I am running R version 3.5.2 (2018-12-20) -- "Eggshell Igloo" on a MacBook Pro, OS 10.14.2.
I have tried a few methods to get these plots. My preferred method was trying to create a stacked barplot of my data (factor grouped over time, my x axis and the counts as the y), with a dichotomous variable count 0,1 counts in each column as they match the counts on the y axis. However I am flexible. I have this code that works if I can overlay a barplot on this that would help.
ggplot(dat, aes(x=factor(yr),y=n, group=(n>0)))+
stat_summary(aes(color=(n>0)),fun.y=length, geom="line")+
scale_color_discrete("Key",labels=c("NN", "N"))+
labs(title= "1992-2018", x="Years",y="n")
using my full dataset, I tried this and got really close to the stacked barplot, it gave me the correct counts per the "yr" variable, however for my variable "n" it gave me a continuous range 0-1.0.
p<-ggplot(data=dat, aes(x=dat$yr, y=n, fill=n)) +
+ geom_bar(stat="identity")
This is the data I am most interested in. I tried to then coerce it into a table then a data frame.
t2<- table(dat$yr, dat$n)
0 1
1992 6 0
1993 10 0
1994 3 1
1995 20 2
1996 15 2
1997 16 0
1998 16 0
1999 9 3
2000 5 0
2001 5 1
2002 7 1
2003 9 2
2004 4 3
2005 6 3
2006 5 3
2007 6 3
2008 4 3
2009 8 4
2010 7 1
2011 4 5
2012 4 5
2013 6 2
2014 0 2
2015 3 3
2016 5 5
2017 4 4
2018 8 5
t<-table(dat$yr)
1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
6 10 4 22 17 16 16 12 5 6 8 11 7
2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017
9 8 9 7 12 8 9 9 8 2 6 10 8
2018
13
I then tried:
df<- data.frame(t, t2)
head(df)
head(df)
Var1 Freq Var1.1. Var2 Freq.1
1 1992 6 1992 0 6
2 1993 10 1993 0 10
3 1994 4 1994 0 3
4 1995 22 1995 0 20
5 1996 17 1996 0 15
6 1997 16 1997 0 16
p<-ggplot(data=df, aes(x=Var1, y=Var2)) +
geom_bar(stat="identity")
p
replacing these for the dataset variables gave me worse results with the y-axis showing no counts per year for "yr" variable and each column was filled all the way to the top of the range of "1".
Again, I would like to get a stacked barplot with the binary "n" in each year column to show the 0/1 sum which should match the 'yr' counts on the y-axis. or, I can use the ggplot I got in the first code I posted and get the sums for each year there, I would take that as well.
this comes really really close. if it also gave a total at the top it would be perfect.
package sjPlot:
sjp.grpfrq(dat$yr, dat$n, bar.pos = c("stack"), show.values = TRUE, show.n = TRUE, show.prc = FALSE, title = NULL)
The major issue with the sjPlot code is I cannot change the legend labels. it shows n= 0, 1. I need to change this to be specific.
Thanks so much in advance!
Try this and see if that's what you want?
ggplot(data=df, aes(x=Var1, y=Freq)) +
geom_bar(stat="identity")
Resolved.
sjp.grpfrq(dat$yr, dat$n, bar.pos = c("stack"), legend.title = "Key", legend.labels = c("NN", "N"), show.values = TRUE, show.n = TRUE, show.prc = FALSE, show.axis.values = TRUE, title = "1992-2018")

How to merge 2 data sets using 2 common columns in each in R [duplicate]

This question already has answers here:
How to join (merge) data frames (inner, outer, left, right)
(13 answers)
Closed 5 years ago.
I'm trying to merge two data frames based on 2 columns in each. I want to merge the Territory column and IDMate column from Data Frame 2 to Data Frame 1 based on matching ID and Year columns.
Data Frame 1:
ID Year
1 1 1998
2 2 2001
3 3 2005
4 4 2008
Data Frame 2:
ID Year Territory IDMate
1 1 1998 A 22
2 1 1999 B 24
3 1 2000 C 25
4 2 2001 D 26
5 2 2002 E 27
6 3 2005 F 28
7 4 2008 G 29
Goal is to get this:
ID Year Territory IDMate
1 1 1998 A 22
2 2 2001 D 26
3 3 2005 F 28
4 4 2008 G 29
You can use left_join from dplyr:
library(dplyr)
res <- left_join(df1, df2, by = c("ID", "Year"))
# ID Year Territory IDMate
# 1 1998 A 22
# 2 2001 D 26
# 3 2005 F 28
# 4 2008 G 29
common <- intersect(data.frame1$col, data.frame2$col)
data.frame2[common,]
Hopefully, this will get you what you want

How to delete observations in R based criterion that observations have same value?

I have the following data frame, from which I would like to remove observations based on three criteria: x=x, y=y and z>=60.
df <- data.frame(x=c(1,1,2,2,3,3,4,4),
y=c(2011,2012,2011,2011,2013,2014,2011,2012),
z=c(15,15,60,60,15,15,30,15))
> df
x y z
1 1 2011 15
2 1 2012 15
3 2 2011 60
4 2 2011 60
5 3 2013 15
6 3 2014 15
7 4 2011 30
8 4 2012 15
The data frame I'm looking for is thus (which one of the x=2 observations is removed doesn't matter):
> df1
x y z
1 1 2011 15
2 1 2012 15
3 2 2011 60
4 3 2013 15
5 3 2014 15
6 4 2011 30
7 4 2012 15
My first thoughts included using unique or duplicate, but I cannot seem to understand how to implement it in practice.
This should do the trick. Look for duplicated x and y entries where z is also greater than or equal to 60:
df[!(duplicated(df[,1:2]) & df$z >= 60), ]
# x y z
#1 1 2011 15
#2 1 2012 15
#3 2 2011 60
#5 3 2013 15
#6 3 2014 15
#7 4 2011 30
#8 4 2012 15

Resources