Partial De-duplication in R based on string value match - r

I have a dataframe named 'reviews' like this:
score_phrase title score release_year release_month release_day
1 Amazing LittleBigPlanet PS Vita 9 2012 9 12
2 Amazing LittleBigPlanet PS Vita -- Marvel Super Hero Edition 9 2012 9 12
3 Great Splice: Tree of Life 8.5 2012 9 12
4 Great NHL 13 8.5 2012 9 11
5 Great NHL 13 8.5 2012 9 11
6 Good Total War Battles: Shogun 7 2012 9 11
7 Awful Double Dragon: Neon 3 2012 9 11
8 Amazing Guild Wars 2 9 2012 9 11
9 Awful Double Dragon: Neon 3 2012 9 11
10 Good Total War Battles: Shogun 7 2012 9 11
Objective: Slight mismatch/typo in column values cause duplication in records. Here Row 1 and Row 2 are duplicates and Row 2 should be dropped after de-duplication.
I used dedup() function of 'SCRUBR' package to perform de-duplication but on a large dataset, I get incorrect number of duplicates when I toggle tolerance level for string matching.
For example:
partial_dup_data <- reviews[1:100,] %>% dedup(tolerance = 0.7)
#count w/o duplicates: 90
attr(partial_dup_data, "dups")
# count of identified duplicates: 16
Could somebody suggest what I am doing incorrectly? Is there another approach to achieve the objective?

Related

Importing .csv file with tidydata

I am having difficulty importing my data in the way I would like to from a .csv file to tidydata.
My data set is made up of descriptive data (age, country, etc.) and then 15 condition columns that I would like to have in just one column (long format). I have previously tried 'melting' the data in a few ways, but it does not turn out the way I intended it to. These are a few things I have tried, I know it is kind of messy. There are quite a few NAs in the data, which seem to be causing an issue. I am trying to create this specific column "Vignette" which will serve as the collective column for the 15 vignette columns I would like in long format.
head(dat)
ID Frequency Gender Country Continent Age
1 5129615189 At least weekly female France Europe 30-50 years
2 5128877943 At least daily female Spain Europe > 50 years
3 5126775994 At least weekly female Spain Europe 30-50 years
4 5126598863 At least daily male Albania Europe 30-50 years
5 5124909744 At least daily female Ireland Europe > 50 years
6 5122047758 At least weekly female Denmark Europe 30-50 years
Practice Specialty Seniority AMS
1 University public hospital centre Infectious diseases 6-10 years Yes
2 Other public hospital Infectious diseases > 10 years Yes
3 University public hospital centre Intensive care > 10 years Yes
4 University public hospital centre Infectious diseases > 10 years No
5 Private hospial/clinic Clinical microbiology > 10 years Yes
6 University public hospital centre Infectious diseases 0-5 years Yes
Durations V01 V02 V03 V04 V05 V06 V07 V08 V09 V10 V11 V12 V13 V14 V15
1 range 7 2 7 7 7 5 7 14 7 42 42 90 7 NA 5
2 range 7 10 10 5 14 5 7 14 10 42 21 42 14 14 14
3 range 7 5 5 7 14 5 5 13 10 42 42 42 5 0 7
4 range 10 7 7 5 7 10 7 5 7 28 14 42 10 10 7
5 range 7 5 7 7 14 7 7 14 10 42 42 90 10 0 7
6 fixed duration 7 3 3 7 10 10 7 14 7 90 90 90 10 7 7
dat_long %>%
gather(Days, Age, -Vignette)
dat$new_sp = NULL
names(dat) <- gsub("new_sp", "", names(dat))
dat_tidy<-melt(
data=dat,
id=0:180,
variable.name="Vignette",
value.name="Days",
na.rm=TRUE
)
dat_tidy<- mutate(dat_tidy,
Days= sub("^V", "", Days)
)
It keeps saying "Error: id variables not found in data: NA"
I have tried to get rid of NAs but it doesn't seem to do anything.
I am guessing you are loading the melt function from reshape2. I will recommend that you try tidyr which is basically the next generation of reshape2.
Your error is presumable that the argument id=0:180. This is basically asking it to keep columns 0-180 as "identifier" columns, and melt the rest (i.e. create a new row for each value in each column).
When you subset more column indices than columns in a data.frame, the non-existing columns are filled in with pure NA - you asked for them, so you get them!
I would recommend loading tidyr, as it is newer. There should be some new verbs in the package that are more intuitive, but I'll give you a solution with the older semantic:
library(tidyr)
dat_tidy <- dat %>% gather('Vignette', 'Days', starts_with('V'))
# or a bit more verbose
dat_tidy <- dat %>% gather('Vignette', 'Days', V01, V02, V03, V04)
And check out the comment #heck1 for asking even better questions.

How can I add new variable with MUTATE: growth rate?

I haven't coded for several months and now am stuck with the following issue.
I have the following dataset:
Year World_export China_exp World_import China_imp
1 1992 3445.534 27.7310 3402.505 6.2220
2 1993 1940.061 27.8800 2474.038 18.3560
3 1994 2458.337 39.6970 2978.314 3.3270
4 1995 4641.168 15.9790 5504.787 18.0130
5 1996 5680.688 74.1650 6939.291 25.1870
6 1997 7206.604 70.2440 8639.422 31.9030
7 1998 7069.725 99.6510 8530.293 41.5030
8 1999 5916.077 169.4593 6673.743 37.8139
9 2000 7331.588 136.2180 8646.253 47.3789
10 2001 7471.374 143.0542 8292.893 41.2899
11 2002 8074.975 217.4286 9092.341 46.4730
12 2003 9956.433 162.2522 11558.007 71.7753
13 2004 13751.671 282.8678 16345.452 157.0768
14 2005 15976.238 430.8655 16708.094 284.1065
15 2006 19728.935 398.6704 22344.856 553.6356
16 2007 24275.244 484.5276 28693.113 815.7914
17 2008 32570.781 613.3714 39381.251 1414.8120
18 2009 21282.228 173.9463 28563.576 1081.3720
19 2010 25283.462 475.7635 34884.450 1684.0839
20 2011 41418.670 636.5881 45759.051 2193.8573
21 2012 46027.529 432.6025 46404.382 2373.4535
22 2013 37132.301 460.7133 43022.550 2829.3705
23 2014 36046.461 640.2552 40502.268 2373.2351
24 2015 26618.982 781.0016 30264.299 2401.1907
25 2016 23537.354 472.7022 27609.884 2129.4806
What I need is simple: to compute growth rates of each variable, that is, find difference between two elements, divide it by first element and multiply by 100.
I'm trying to write a script, that ends up with error message:
trade_Ch %>%
mutate (
World_exp_grate = sapply(2:nrow(trade_Ch),function(i)((World_export[i]-World_export[i-1])/World_export[i-1]))
)
Error in mutate_impl(.data, dots) : Column World_exp_grate must
be length 25 (the number of rows) or one, not 24
although this piece of code gives me right values:
x <- sapply(2:nrow(trade_Ch),function(i)((trade_Ch$World_export[i]-trade_Ch$World_export[i-1])/trade_Ch$World_export[i-1]))
How can I correctly embedd the code into my MUTATE part from dplyr package?
OR
Is there is another elegant way to solve this issue?
library(dplyr)
df %>%
mutate_each(funs(chg = ((.-lag(.))/lag(.))*100), World_export:China_imp)
trade_Ch %>%
mutate(world_exp_grate = 100*(World_export - lag(World_export))/lag(World_export))
The problem is that you cannot calculate the World_exp_grate for your first row. Therefore you have to set it to NA.
One variant to solve this is
trade_Ch %>%
mutate (World_export_lag = lag(World_export),
World_exp_grate = (World_export - World_export_lag)/World_export_lag)) %>%
select(-World_export_lag)
lag shifts the vector by one position.
lag(1:5)
# [1] NA 1 2 3 4

Creating a dynamic vector with loop in R [duplicate]

This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
Closed 6 years ago.
I need to create a third column in the dataframe (called teste) below which would contain the mean for the model of vehicle represented, such that on a car row it would calculate the mean for all car models and similar calculation for bikes and trucks.
model price
car 10
car 11
car 12
car 13
car 14
bike 5
bike 6
bike 7
bike 8
bike 9
truck 12
truck 13
truck 14
truck 15
truck 16
I was able to create a for loop which can print the desired results with the following R code:
for(x in teste$model){
print(mean(teste[teste$model==x, ]$price))
}
However, when trying to create the third column or vector the code below is giving me an error stating that the replacement row is bigger than the data.
teste$media <- rep(NA, 15)
for(x in teste$model){
teste$media[x] <- mean(teste[teste$model==x, ]$price)
}
I have no idea why the replacement vector is bigger. Can anyone help me identify the error or propose another way to acomplish the goal.
Thank you all in advance
Alex
Use ave which uses mean as default function. See ?ave.
> teste$media <- ave(teste$price, teste$model)
> teste
model price media
1 car 10 12
2 car 11 12
3 car 12 12
4 car 13 12
5 car 14 12
6 bike 5 7
7 bike 6 7
8 bike 7 7
9 bike 8 7
10 bike 9 7
11 truck 12 14
12 truck 13 14
13 truck 14 14
14 truck 15 14
15 truck 16 14
With dplyr:
library(dplyr)
teste %>% group_by(model) %>%
mutate(media=mean(price))
Or with data.table:
library(data.table)
setDT(teste)[ , media:=mean(price), by=model]

How to calculate the exponential in some columns of a dataframe in R?

I have a dataframe:
X Year Dependent.variable.1 Forecast.Dependent.variable.1
1 2009 12.42669703 12.41831191
2 2010 12.39309563 12.40043599
3 2011 12.36596964 12.38256006
4 2012 12.32067284 12.36468414
5 2013 12.303095 12.34680822
6 2014 NA 12.32893229
7 2015 NA 12.31105637
8 2016 NA 12.29318044
9 2017 NA 12.27530452
10 2018 NA 12.25742859
I want to calulate the exponential of the third and fourth columns. How can I do that?
In case your dataframe is called dfs, you can do the following:
dfs[c('Dependent.variable.1','Forecast.Dependent.variable.1')] <- exp(dfs[c('Dependent.variable.1','Forecast.Dependent.variable.1')])
which gives you:
X Year Dependent.variable.1 Forecast.Dependent.variable.1
1 1 2009 249371 247288.7
2 2 2010 241131 242907.5
3 3 2011 234678 238603.9
4 4 2012 224285 234376.5
5 5 2013 220377 230224.0
6 6 2014 NA 226145.1
7 7 2015 NA 222138.5
8 8 2016 NA 218202.9
9 9 2017 NA 214336.9
10 10 2018 NA 210539.5
In case you know the column numbers, this could then also simply be done by using:
dfs[,3:4] <- exp(dfs[,3:4])
which gives you the same result as above. I usually prefer to use the actual column names as the indices might change when the data frame is further processed (e.g. I delete columns, then the indices change).
Or you could do:
dfs$Dependent.variable.1 <- exp(dfs$Dependent.variable.1)
dfs$Forecast.Dependent.variable.1 <- exp(dfs$Forecast.Dependent.variable.1)
In case you want to store these columns in new variables (below they are called exp1 and exp2, respectively), you can do:
exp1 <- exp(dfs$Forecast.Dependent.variable.1)
exp2 <- exp(dfs$Dependent.variable.1)
In case you want to apply it to more than two columns and/or use more complicated functions, I highly recommend to look at apply/lappy.
Does that answer your question?

Obtain means within site and year in R

I have a data set with multiple sites that were each sampled over multiple years. As part of this I have climate data that were sampled throughout each year as well as calculated means for several variables (mean annual temp, mean annual precipitation, mean annual snow depth, etc). Here is what the data frame actually looks like:
site date year temp precip mean.ann.temp mn.ann.precip
a 5/1/10 2010 15 0 6 .03
a 6/2/10 2010 18 1 6 .03
a 7/3/10 2010 22 0 6 .03
b 5/2/10 2010 16 2 7 .04
b 6/3/10 2010 17 3 7 .04
b 7/4/10 2010 20 0 7 .04
c 5/3/10 2010 14 0 5 .06
c 6/4/10 2010 13 0 5 .06
c 7/8/10 2010 25 0 5 .06
d 5/5/10 2010 16 15 10 .2
d 6/6/10 2010 22 0 10 .2
d 7/7/10 2010 24 0 10 .2
...
It then goes on the same way for multiple years.
How can I extract the mean.ann.temp and mn.ann.precip for each site and year? I've tried doubling up tapply() with no success and using double for loops, but I can't seem to figure it out. Can someone help me? Or do I have to do it the long and tedious way of just subsetting everything out?
Thanks,
Paul
Subset the columns and wrap it in a unique.
unique(d[,c("site","year","mean.ann.temp","mn.ann.precip")])
A similar way if the last two columns are different, and you want the first row:
d[!duplicated(d[,c("site","year")]),]
To compute summaries using plyr
require(plyr)
ddply(yourDF, .(site,year), summarize,
meanTemp=mean(mean.ann.temp),
meanPrec=mean(mn.ann.precip)
)

Resources