How to create unique rows in a data frame - r

I have a dataframe where rows are duplicated. I need to create unique rows from this. I tried a couple of options but they don't seem to work
l1 <-summarise(group_by(l,bowler,wickets),economyRate,d=unique(date))
This works for some rows but also gives the error "Expecting a single value". The dataframe 'l' looks like this
bowler overs maidens runs wickets economyRate date opposition
(fctr) (int) (int) (dbl) (dbl) (dbl) (date) (chr)
1 MA Starc 9 0 51 0 5.67 2010-10-20 India
2 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka
3 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka
4 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka
5 MA Starc 9 0 27 4 3.00 2010-11-07 Sri Lanka
6 MA Starc 6 0 33 2 5.50 2012-02-05 India
7 MA Starc 6 0 33 2 5.50 2012-02-05 India
8 MA Starc 10 0 50 2 5.00 2012-02-10 Sri Lanka
9 MA Starc 10 0 50 2 5.00 2012-02-10 Sri Lanka
10 MA Starc 8 0 49 0 6.12 2012-02-12 India
The date is unique and can be used to get the rows for which the row can be selected. Please let me know how this can be done.

In the example dataset, there are more than one unique elements of 'date' per each 'bowler', 'wickets' combination. One option would be to paste the unique 'date' together
l %>%
group_by(bowler, wickets) %>%
summarise(economyRate= mean(economyRate), d = toString(unique(date)))
Or create 'd' as a list column
l %>%
group_by(bowler, wickets) %>%
summarise(economyRate= mean(economyRate), d = list(unique(date)))
With respect to 'economyRate', I am guessing the OP need the mean of that.
If we need to create a column of unique date in the original dataset, use mutate
l %>%
group_by(bowler, wickets) %>%
mutate(d = list(unique(date)))
As the OP didn't provide the expected output, the below could be also the result
l %>%
group_by(bowler, wickets) %>%
distinct(date)
Or as #Frank mentioned
l %>%
group_by(bowler,wickets,date) %>%
slice(1L)

If I get the intention of th OP right, he is asking to simply remove the duplicate rows. So, I would use
unique(l1)
That's what ?unique says:
unique returns a vector, data frame or array like x but with duplicate elements/rows removed.

Data
l <- read.table(text = "bowler overs maidens runs wickets economyRate date opposition
1 MA_Starc 9 0 51 0 5.67 2010-10-20 India
2 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
3 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
4 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
5 MA_Starc 9 0 27 4 3.00 2010-11-07 Sri-Lanka
6 MA_Starc 6 0 33 2 5.50 2012-02-05 India
7 MA_Starc 6 0 33 2 5.50 2012-02-05 India
8 MA_Starc 10 0 50 2 5.00 2012-02-10 Sri-Lanka
9 MA_Starc 10 0 50 2 5.00 2012-02-10 Sri-Lanka
10 MA_Starc 8 0 49 0 6.12 2012-02-12 India")
Distinct
Use dplyr::distinct to remove duplicated rows.
ldistinct <- distinct(l)
# bowler overs maidens runs wickets economyRate date
# 1 MA_Starc 9 0 51 0 5.67 2010-10-20
# 2 MA_Starc 9 0 27 4 3.00 2010-11-07
# 3 MA_Starc 6 0 33 2 5.50 2012-02-05
# 4 MA_Starc 10 0 50 2 5.00 2012-02-10
# 5 MA_Starc 8 0 49 0 6.12 2012-02-12
# opposition
# 1 India
# 2 Sri-Lanka
# 3 India
# 4 Sri-Lanka
# 5 India
l2 <- summarise(group_by(ldistinct,bowler,wickets),
economyRate,d=unique(date))
# Error: expecting a single value
But it's not enough here, there are still many dates for
one combination of bowler and wickets.
Collapse values together
By pasting multiple values together you will see that there are many dates and many economyRate for a single combination of bowler and wickets.
l3 <- summarise(group_by(l,bowler,wickets),
economyRate = paste(unique(economyRate),collapse=", "),
d=paste(unique(date),collapse=", "))
l3
# bowler wickets economyRate d
# (fctr) (int) (chr) (chr)
# 1 MA_Starc 0 5.67, 6.12 2010-10-20, 2012-02-12
# 2 MA_Starc 2 5.5, 5 2012-02-05, 2012-02-10
# 3 MA_Starc 4 3 2010-11-07

So, I took an unusual route to doing this disection, but I let the date remain a factor when it came over from the csv file I created. you could easily the date column to a factor with
l1$date<-as.factor(l1$date)
This will make that row a non-date row, you could also convert to character, either will work fine. This is what it looks like structurally.
str(l1)
'data.frame': 10 obs. of 10 variables:
$ bowler : Factor w/ 2 levels "(fctr)","MA": 2 2 2 2 2 2 2 2 2 2
$ overs : Factor w/ 2 levels "(int)","Starc": 2 2 2 2 2 2 2 2 2 2
$ maidens : Factor w/ 5 levels "(int)","10","6",..: 5 5 5 5 5 3 3 2 2 4
$ runs : Factor w/ 2 levels "(dbl)","0": 2 2 2 2 2 2 2 2 2 2
$ wickets : Factor w/ 6 levels "(dbl)","27","33",..: 6 2 2 2 2 3 3 5 5 4
$ economyRate: Factor w/ 4 levels "(dbl)","0","2",..: 2 4 4 4 4 3 3 3 3 2
$ date : Factor w/ 6 levels "(date)","3","5",..: 5 2 2 2 2 4 4 3 3 6
$ opposition : Factor w/ 6 levels "(chr)","10/20/2010",..: 2 3 3 3 3 6 6 4 4 5
$ X.1 : Factor w/ 3 levels "","India","Sri": 2 3 3 3 3 2 2 3 3 2
$ X.2 : Factor w/ 2 levels "","Lanka": 1 2 2 2 2 1 1 2 2 1
After that it is about making sure that you are using the sub-setting grammar properly with the most concise query:
l2<-l1[!duplicated(l1$date),]
And this is what is returned, 5 rows of unique data:
bowler overs maidens runs wickets economyRate date opposition X.1 X.2
2 MA Starc 9 0 51 0 5.67 10/20/2010 India
3 MA Starc 9 0 27 4 3 11/7/2010 Sri Lanka
7 MA Starc 6 0 33 2 5.5 2/5/2012 India
9 MA Starc 10 0 50 2 5 2/10/2012 Sri Lanka
11 MA Starc 8 0 49 0 6.12 2/12/2012 India
The only thing you need to be careful of is to keep that comma after the !duplicated(l1$date) to be sure that ALL columns are searched and included in the final subset.
If you want dates or characters you can as.POSIXct or as.character convert them to a usable format for the rest of your manipulation.
I hope this is useful to you!

Related

Get the average of the values of one column for the values in another

I was not so sure how to ask this question. i am trying to answer what is the average tone when an initiative is mentioned and additionally when a topic, and a goal( or achievement) are mentioned. My dataframe (df) has many mentions of 70 initiatives (rows). meaning my df has 500+ rows of data, but only 70 Initiatives.
My data looks like this
> tabmean
Initiative Topic Goals Achievements Tone
1 52 44 2 2 2
2 294 42 2 2 2
3 103 31 2 2 2
4 52 41 2 2 2
5 87 26 2 1 1
6 52 87 2 2 2
7 136 81 2 2 2
8 19 7 2 2 1
9 19 4 2 2 2
10 0 63 2 2 2
11 0 25 2 2 2
12 19 51 2 2 2
13 52 51 2 2 2
14 108 94 2 2 1
15 52 89 2 2 2
16 110 37 2 2 2
17 247 25 2 2 2
18 66 95 2 2 2
19 24 49 2 2 2
20 24 110 2 2 2
I want to find what is the mean or average Tone when an Initiative is mentioned. as well as what is the Tone when an Initiative, a Topic and a Goal are mentioned at the same time. The code options for Tone are : positive(coded: 1), neutral(2), negative (coded:3), and both positive and negative(4). Goals and Achievements are coded yes(1) and no(2).
I have used this code:
GoalMeanTone <- tabmean %>%
group_by(Initiative,Topic,Goals,Tone) %>%
summarize(averagetone = mean(Tone))
With Solution output :
GoalMeanTone
# A tibble: 454 x 5
# Groups: Initiative, Topic, Goals [424]
Initiative Topic Goals Tone averagetone
<chr> <chr> <chr> <chr> <dbl>
1 0 104 2 0 NA
2 0 105 2 0 NA
3 0 22 2 0 NA
4 0 25 2 0 NA
5 0 29 2 0 NA
6 0 30 2 1 NA
7 0 31 1 1 NA
8 0 42 1 0 NA
9 0 44 2 0 NA
10 0 44 NA 0 NA
# ... with 444 more rows
note that for Initiative Value 0 means "other initiative".
and I've also tried this code
library(plyr)
GoalMeanTone2 <- ddply( tabmean, .(Initiative), function(x) mean(tabmean$Tone) )
with solution output
> GoalMeanTone2
Initiative V1
1 0 NA
2 1 NA
3 101 NA
4 102 NA
5 103 NA
6 104 NA
7 105 NA
8 107 NA
9 108 NA
10 110 NA
Note that in both instances, I do not get an average for Tone but instead get NA's
I have removed the NAs in the df from the column "Tone" also have tried to remove all the other mission values in the df ( its only about 30 values that i deleted).
and I have also re-coded the values for Tone :
tabmean<-Meantable %>% mutate(Tone=recode(Tone,
`1`="1",
`2`="0",
`3`="-1",
`4`="2"))
I still cannot manage to get the average tone for an initiative. Maybe the solution is more obvious than i think, but have gotten stuck and have no idea how to proceed or solve this.
i'd be super grateful for a better code to get this. Thanks!
I'm not completely sure what you mean by 'the average tone when an initiative is mentioned', but let's say that you'd want to get the average tone for when initiative=1, you could try the following:
tabmean %>% filter(initiative==1) %>% summarise(avg_tone=mean(tone, na.rm=TRUE)
Note that (1) you have to add na.rm==TRUE to the summarise call if you have missing values in the column that you are summarizing, otherwise it will only produce NA's, and (2) check that the columns are of type numeric (you could check that with str(tabmean) and for example change tone to numeric with tabmean <- tabmean %>% mutate(tone=as.numeric(tone)).

problem in changing matrix to a data frame with same dimensions

I have tried to create a data frame from a matrix; however, the result has a different dimension comparing to the main matrix. Please see below my code:
out <- table(UL_Final$Issue_Year, UL_Final$Insured_Age_Group)
out <- out/rowSums(out) #changing all numbers to ratio
The result is a matrix 12 by 7:
1 2 3 4 5 6 7
1387 0.165137615 0.036697248 0.229357798 0.321100917 0.201834862 0.018348624 0.027522936
1388 0.149222065 0.110325318 0.197312588 0.342291372 0.136492221 0.055162659 0.009193777
1389 0.144979508 0.101946721 0.222848361 0.335553279 0.138575820 0.046362705 0.009733607
1390 0.146991622 0.120030465 0.191622239 0.336024372 0.142269612 0.052551409 0.010510282
1391 0.165462754 0.111794582 0.185835214 0.321049661 0.135553047 0.064503386 0.015801354
1392 0.162399144 0.109583402 0.165321917 0.317388441 0.146344476 0.076115594 0.022847028
1393 0.181602139 0.116447173 0.151104070 0.325131201 0.148628577 0.062778493 0.014308347
1394 0.163760504 0.098529412 0.142489496 0.323792017 0.178728992 0.076050420 0.016649160
1395 0.137097032 0.094699511 0.128981757 0.321320170 0.197610147 0.098245950 0.022045433
1396 0.167187958 0.103851041 0.112696706 0.293202033 0.200689082 0.099306031 0.023067149
1397 0.193250090 0.130540713 0.108114843 0.270743930 0.186411584 0.091364656 0.019574185
1398 0.208026156 0.147573562 0.100455157 0.249503173 0.191935380 0.083338676 0.019167895
then using the code below:
out <- data.frame(out)
However, the result will change to a data frame and dimension of 84 by 3
Var1 Var2 Freq
1 1387 1 0.165137615
2 1388 1 0.149222065
3 1389 1 0.144979508
4 1390 1 0.146991622
5 .... .......
I am not sure why this happens. However in another case, as I explained below, I am not seeing such strange behavior. In another case, I used the code below to calculate another ratio for another variable:
out <- table( df_select$Insured_Age_Group,df_select$Policy_Status)
out <- cbind(out, Ratio = out[,2]/rowSums(out))
the result is :
Issuance Surrended Ratio
1 31046 5735 0.1559229
2 20039 4409 0.1803420
3 20399 9228 0.3114726
4 48677 17216 0.2612721
5 30045 8132 0.2130078
6 13947 4106 0.2274414
7 3157 1047 0.2490485
Now if we used the code below (by #Ronak Shah):
out <- data.frame(out) %>% mutate(x = row_number())
the result is :
Issuance Surrended Ratio x
1 31046 5735 0.1559229 1
2 20039 4409 0.1803420 2
3 20399 9228 0.3114726 3
4 48677 17216 0.2612721 4
5 30045 8132 0.2130078 5
6 13947 4106 0.2274414 6
7 3157 1047 0.2490485 7
As you can see the result is now a data frame with same dimension. Can anyone explain why this happens?
See ?table for an explanation:
The as.data.frame method for objects inheriting from class "table" can be used to convert the array-based representation of a contingency table to a data frame containing the classifying factors and the corresponding entries (the latter as component named by responseName). This is the inverse of xtabs.
A workaround is to use as.data.frame.matrix:
m <- table(mtcars$carb, mtcars$gear)
as.data.frame(m)
# Var1 Var2 Freq
# 1 1 3 3
# 2 2 3 4
# 3 3 3 3
# 4 4 3 5
# 5 6 3 0
# 6 8 3 0
# 7 1 4 4
# 8 2 4 4
# 9 3 4 0
# 10 4 4 4
# 11 6 4 0
# 12 8 4 0
# 13 1 5 0
# 14 2 5 2
# 15 3 5 0
# 16 4 5 1
# 17 6 5 1
# 18 8 5 1
as.data.frame.matrix(m)
# 3 4 5
# 1 3 4 0
# 2 4 4 2
# 3 3 0 0
# 4 5 4 1
# 6 0 0 1
# 8 0 0 1

Counting the number of changes of a categorical variable during repeated measurements within a category

I'm working with a dataset about migration across the country with the following columns:
i birth gender race region urban wage year educ
1 58 2 3 1 1 4620 1979 12
1 58 2 3 1 1 4620 1980 12
1 58 2 3 2 1 4620 1981 12
1 58 2 3 2 1 4700 1982 12
.....
i birth gender race region urban wage year educ
45 65 2 3 3 1 NA 1979 10
45 65 2 3 3 1 NA 1980 10
45 65 2 3 4 2 11500 1981 10
45 65 2 3 1 1 11500 1982 10
i = individual id. They follow a large group of people for 25 years and record changes in 'region' (categorical variables, 1-4) , 'urban' (dummy), 'wage' and 'educ'.
How do I count the aggregate number of times 'region' or 'urban' has changed (eg: from region 1 to region 3 or from urban 0 to 1) during the observation period (25 year period) within each subject? I also have some NA's in the data (which should be ignored)
A simplified version of expected output:
i changes in region
1 1
...
45 2
i changes in urban
1 0
...
45 2
I would then like to sum up the number of changes for region and urban.
I came across these answers: Count number of changes in categorical variables during repeated measurements and Identify change in categorical data across datapoints in R but I still don't get it.
Here's a part of the data for i=4.
i birth gender race region urban wage year educ
4 62 2 3 1 1 NA 1979 9
4 62 2 3 NA NA NA 1980 9
4 62 2 3 4 1 0 1981 9
4 62 2 3 4 1 1086 1982 9
4 62 2 3 1 1 70 1983 9
4 62 2 3 1 1 0 1984 9
4 62 2 3 1 1 0 1985 9
4 62 2 3 1 1 7000 1986 9
4 62 2 3 1 1 17500 1987 9
4 62 2 3 1 1 21320 1988 9
4 62 2 3 1 1 21760 1989 9
4 62 2 3 1 1 0 1990 9
4 62 2 3 1 1 0 1991 9
4 62 2 3 1 1 30500 1992 9
4 62 2 3 1 1 33000 1993 9
4 62 2 3 NA NA NA 1994 9
4 62 2 3 4 1 35000 1996 9
Here, output should be:
i change_reg change_urban
4 3 0
Here is something I hope will get your closer to what you need.
First you group by i. Then, you can then create a column that will indicate a 1 for each change in region. This compares the current value for the region with the previous value (using lag). Note if the previous value is NA (when looking at the first value for a given i), it will be considered no change.
Same approach is taken for urban. Then, summarize totaling up all the changes for each i. I left in these temporary variables so you can examine if you are getting the results desired.
Edit: If you wish to remove rows that have NA for region or urban you can add drop_na first.
library(dplyr)
library(tidyr)
df_tot <- df %>%
drop_na(region, urban) %>%
group_by(i) %>%
mutate(reg_change = ifelse(region == lag(region) | is.na(lag(region)), 0, 1),
urban_change = ifelse(urban == lag(urban) | is.na(lag(urban)), 0, 1)) %>%
summarize(tot_region = sum(reg_change),
tot_urban = sum(urban_change))
# A tibble: 3 x 3
i tot_region tot_urban
<int> <dbl> <dbl>
1 1 1 0
2 4 3 0
3 45 2 2
Edit: Afterwards, to get a grand total for both tot_region and tot_urban columns, you can use colSums. (Store your earlier result as df_tot as above.)
colSums(df_tot[-1])
tot_region tot_urban
6 2

r: append mean of a subset of columns by name

I have this df:
webvisits1 webvisits2 webvisits3 webvisits4
s001 2 0 11 2
s002 11 2 23 3
s003 12 1 1 5
s004 13 5 5 0
s005 4 3 9 3
I need to create an output dataframe with an added columns containing the difference between the mean of webvisits(3-4) and webvisits (1-2), like so:
webvisits1 webvisits2 webvisits3 webvisits4 difference_mean
s001 2 0 11 2 -5.5
s002 11 2 23 3 -6.5
s003 12 1 1 5 3.5
s004 13 5 5 0 6.5
s005 4 3 9 3 -2.5
Is there an easy way to do so, considering that column names (webvisits) are important?
Thank you
rowSums function can sum rows of each variables, then after find difference between existing variables and take mean of them
library(dplyr)
dt %>%
mutate(difference_mean = (rowSums(dt[,2:3])-rowSums(dt[,4:5]))/2)
s.no webvisits1 webvisits2 webvisits3 webvisits4 difference_mean
1 s001 2 0 11 2 -5.5
2 s002 11 2 23 3 -6.5
3 s003 12 1 1 5 3.5
4 s004 13 5 5 0 6.5
5 s005 4 3 9 3 -2.5
We subset the dataset into two (df[1:2], df[3:4]), get the difference and then with rowMeans we find the mean, create a new column 'differenceMean' using transform.
df <- transform(df, differenceMean = rowMeans(df[1:2]- df[3:4]))
df
# webvisits1 webvisits2 webvisits3 webvisits4 differenceMean
#s001 2 0 11 2 -5.5
#s002 11 2 23 3 -6.5
#s003 12 1 1 5 3.5
#s004 13 5 5 0 6.5
#s005 4 3 9 3 -2.5

Using R to generate a time series of averages from a very large dataset without using for loops

I am working with a large dataset of patent data. Each row is an individual patent, and columns contain information including application year and number of citations in the patent.
> head(p)
allcites appyear asscode assgnum cat cat_ocl cclass country ddate gday gmonth
1 6 1974 2 1 6 6 2/161.4 US 6 1
2 0 1974 2 1 6 6 5/11 US 6 1
3 20 1975 2 1 6 6 5/430 US 6 1
4 4 1974 1 NA 5 <NA> 114/354 6 1
5 1 1975 1 NA 6 6 12/142S 6 1
6 3 1972 2 1 6 6 15/53.4 US 6 1
gyear hjtwt icl icl_class icl_maingroup iclnum nclaims nclass nclass_ocl
1 1976 1 A41D 1900 A41D 19 1 4 2 2
2 1976 1 A47D 701 A47D 7 1 3 5 5
3 1976 1 A47D 702 A47D 7 1 24 5 5
4 1976 1 B63B 708 B63B 7 1 7 114 9
5 1976 1 A43D 900 A43D 9 1 9 12 12
6 1976 1 B60S 304 B60S 3 1 12 15 15
patent pdpass state status subcat subcat_ocl subclass subclass1 subclass1_ocl
1 3930271 10030271 IL 63 63 161.4 161.4 161
2 3930272 10156902 PA 65 65 11.0 11 11
3 3930273 10112031 MO 65 65 430.0 430 331
4 3930274 NA CA 55 NA 354.0 354 2
5 3930275 NA NJ 63 63 NA 142S 142
6 3930276 10030276 IL 69 69 53.4 53.4 53
subclass_ocl term_extension uspto_assignee gdate
1 161 0 251415 1976-01-06
2 11 0 246000 1976-01-06
3 331 0 10490 1976-01-06
4 2 0 0 1976-01-06
5 142 0 0 1976-01-06
6 53 0 243840 1976-01-06
I am attempting to create a new data frame which contains the mean number of citations (allcites) per application year (appyear), separated by category (cat), for patents from 1970 to 2006 (the data goes all the way back to 1901). I did this successfully, but I feel like my solution is somewhat ad hoc and does not take advantage of the specific capabilities of R. Here is my solution
#citations by category
citescat <- data.frame("chem"=integer(37),
"comp"=integer(37),
"drugs"=integer(37),
"ee"=integer(37),
"mech"=integer(37),
"other"=integer(37),
"year"=1970:2006
)
for (i in 1:37) {
for (j in 1:6) {
citescat[i,j] <- mean(p$allcites[p$appyear==(i+1969) & p$cat==j], na.rm=TRUE)
}
}
I am wondering if there is a simple way to do this without using the nested for loops which would make it easy to make small tweaks to it. It is hard for me to pin down exactly what I am looking for other than this, but my code just looks ugly to me and I suspect that there are better ways to do this in R.
Joran is right - here's a plyr solution. Without your dataset in a usable form it's hard to show you exactly, but here it is in a simplified dataset:
p <- data.frame(allcites = sample(1:20, 20), appyear = 1974:1975, pcat = rep(1:4, each = 5))
#First calculate the means of each group
cites <- ddply(p, .(appyear, pcat), summarise, meancites = mean(allcites, na.rm = T))
#This gives us the data in long form
# appyear pcat meancites
# 1 1974 1 14.666667
# 2 1974 2 9.500000
# 3 1974 3 10.000000
# 4 1974 4 10.500000
# 5 1975 1 16.000000
# 6 1975 2 4.000000
# 7 1975 3 12.000000
# 8 1975 4 9.333333
#Now use dcast to get it in wide form (which I think your for loop was doing):
citescat <- dcast(cites, appyear ~ pcat)
# appyear 1 2 3 4
# 1 1974 14.66667 9.5 10 10.500000
# 2 1975 16.00000 4.0 12 9.333333
Hopefully you can see how to adapt that to your specific data.

Resources