Creating a dynamic vector with loop in R [duplicate] - r

This question already has answers here:
Adding a column of means by group to original data [duplicate]
(4 answers)
Closed 6 years ago.
I need to create a third column in the dataframe (called teste) below which would contain the mean for the model of vehicle represented, such that on a car row it would calculate the mean for all car models and similar calculation for bikes and trucks.
model price
car 10
car 11
car 12
car 13
car 14
bike 5
bike 6
bike 7
bike 8
bike 9
truck 12
truck 13
truck 14
truck 15
truck 16
I was able to create a for loop which can print the desired results with the following R code:
for(x in teste$model){
print(mean(teste[teste$model==x, ]$price))
}
However, when trying to create the third column or vector the code below is giving me an error stating that the replacement row is bigger than the data.
teste$media <- rep(NA, 15)
for(x in teste$model){
teste$media[x] <- mean(teste[teste$model==x, ]$price)
}
I have no idea why the replacement vector is bigger. Can anyone help me identify the error or propose another way to acomplish the goal.
Thank you all in advance
Alex

Use ave which uses mean as default function. See ?ave.
> teste$media <- ave(teste$price, teste$model)
> teste
model price media
1 car 10 12
2 car 11 12
3 car 12 12
4 car 13 12
5 car 14 12
6 bike 5 7
7 bike 6 7
8 bike 7 7
9 bike 8 7
10 bike 9 7
11 truck 12 14
12 truck 13 14
13 truck 14 14
14 truck 15 14
15 truck 16 14

With dplyr:
library(dplyr)
teste %>% group_by(model) %>%
mutate(media=mean(price))
Or with data.table:
library(data.table)
setDT(teste)[ , media:=mean(price), by=model]

Related

Importing .csv file with tidydata

I am having difficulty importing my data in the way I would like to from a .csv file to tidydata.
My data set is made up of descriptive data (age, country, etc.) and then 15 condition columns that I would like to have in just one column (long format). I have previously tried 'melting' the data in a few ways, but it does not turn out the way I intended it to. These are a few things I have tried, I know it is kind of messy. There are quite a few NAs in the data, which seem to be causing an issue. I am trying to create this specific column "Vignette" which will serve as the collective column for the 15 vignette columns I would like in long format.
head(dat)
ID Frequency Gender Country Continent Age
1 5129615189 At least weekly female France Europe 30-50 years
2 5128877943 At least daily female Spain Europe > 50 years
3 5126775994 At least weekly female Spain Europe 30-50 years
4 5126598863 At least daily male Albania Europe 30-50 years
5 5124909744 At least daily female Ireland Europe > 50 years
6 5122047758 At least weekly female Denmark Europe 30-50 years
Practice Specialty Seniority AMS
1 University public hospital centre Infectious diseases 6-10 years Yes
2 Other public hospital Infectious diseases > 10 years Yes
3 University public hospital centre Intensive care > 10 years Yes
4 University public hospital centre Infectious diseases > 10 years No
5 Private hospial/clinic Clinical microbiology > 10 years Yes
6 University public hospital centre Infectious diseases 0-5 years Yes
Durations V01 V02 V03 V04 V05 V06 V07 V08 V09 V10 V11 V12 V13 V14 V15
1 range 7 2 7 7 7 5 7 14 7 42 42 90 7 NA 5
2 range 7 10 10 5 14 5 7 14 10 42 21 42 14 14 14
3 range 7 5 5 7 14 5 5 13 10 42 42 42 5 0 7
4 range 10 7 7 5 7 10 7 5 7 28 14 42 10 10 7
5 range 7 5 7 7 14 7 7 14 10 42 42 90 10 0 7
6 fixed duration 7 3 3 7 10 10 7 14 7 90 90 90 10 7 7
dat_long %>%
gather(Days, Age, -Vignette)
dat$new_sp = NULL
names(dat) <- gsub("new_sp", "", names(dat))
dat_tidy<-melt(
data=dat,
id=0:180,
variable.name="Vignette",
value.name="Days",
na.rm=TRUE
)
dat_tidy<- mutate(dat_tidy,
Days= sub("^V", "", Days)
)
It keeps saying "Error: id variables not found in data: NA"
I have tried to get rid of NAs but it doesn't seem to do anything.
I am guessing you are loading the melt function from reshape2. I will recommend that you try tidyr which is basically the next generation of reshape2.
Your error is presumable that the argument id=0:180. This is basically asking it to keep columns 0-180 as "identifier" columns, and melt the rest (i.e. create a new row for each value in each column).
When you subset more column indices than columns in a data.frame, the non-existing columns are filled in with pure NA - you asked for them, so you get them!
I would recommend loading tidyr, as it is newer. There should be some new verbs in the package that are more intuitive, but I'll give you a solution with the older semantic:
library(tidyr)
dat_tidy <- dat %>% gather('Vignette', 'Days', starts_with('V'))
# or a bit more verbose
dat_tidy <- dat %>% gather('Vignette', 'Days', V01, V02, V03, V04)
And check out the comment #heck1 for asking even better questions.

Partial De-duplication in R based on string value match

I have a dataframe named 'reviews' like this:
score_phrase title score release_year release_month release_day
1 Amazing LittleBigPlanet PS Vita 9 2012 9 12
2 Amazing LittleBigPlanet PS Vita -- Marvel Super Hero Edition 9 2012 9 12
3 Great Splice: Tree of Life 8.5 2012 9 12
4 Great NHL 13 8.5 2012 9 11
5 Great NHL 13 8.5 2012 9 11
6 Good Total War Battles: Shogun 7 2012 9 11
7 Awful Double Dragon: Neon 3 2012 9 11
8 Amazing Guild Wars 2 9 2012 9 11
9 Awful Double Dragon: Neon 3 2012 9 11
10 Good Total War Battles: Shogun 7 2012 9 11
Objective: Slight mismatch/typo in column values cause duplication in records. Here Row 1 and Row 2 are duplicates and Row 2 should be dropped after de-duplication.
I used dedup() function of 'SCRUBR' package to perform de-duplication but on a large dataset, I get incorrect number of duplicates when I toggle tolerance level for string matching.
For example:
partial_dup_data <- reviews[1:100,] %>% dedup(tolerance = 0.7)
#count w/o duplicates: 90
attr(partial_dup_data, "dups")
# count of identified duplicates: 16
Could somebody suggest what I am doing incorrectly? Is there another approach to achieve the objective?

Performing the colsum based on row values [duplicate]

This question already has answers here:
Calculate the mean by group
(9 answers)
Aggregate / summarize multiple variables per group (e.g. sum, mean)
(10 answers)
Closed 5 years ago.
Hi I have 3 data set with contains the items and counts. I need to add the all data sets and combine the count based on the item names. He is my input.
Df1 <- data.frame(items =c("Cookies", "Candys","Toys","Games"), Counts = c( 10,20,30,5))
Df2 <- data.frame(items =c( "Candys","Cookies","Toys"), Counts = c( 5,21,20))
Df3 <- data.frame(items =c( "Playdows","Gummies","Candys"), Counts = c(10,15,20))
Df_all <- rbind(Df1,Df2,Df3)
Df_all
items Counts
1 Cookies 10
2 Candys 20
3 Toys 30
4 Games 5
5 Candys 5
6 Cookies 21
7 Toys 20
8 Playdows 10
9 Gummies 15
10 Candys 20
I need to combine the columns based on the item values. Delete the Row after adding the values. My output should be
items Counts
1 Cookies 31
2 Candys 45
3 Toys 50
4 Games 5
5 Playdows 10
6 Gummies 15
Could you help in getting this output in r.
use dplyr:
library(dplyr)
result<-Df_all%>%group_by(items)%>%summarize(sum(Counts))
> result
# A tibble: 6 x 2
items `sum(Counts)`
<fct> <dbl>
1 Candys 45.0
2 Cookies 31.0
3 Games 5.00
4 Toys 50.0
5 Gummies 15.0
6 Playdows 10.0
You can use tapply
tapply(Df_all$Counts, Df_all$items, FUN=sum)
what returns
Candys Cookies Games Toys Gummies Playdows
45 31 5 50 15 10

dcast - concatenate column values with column names [duplicate]

This question already has answers here:
how to spread or cast multiple values in r [duplicate]
(2 answers)
Closed 7 years ago.
I would like to concatenate column values with column names to create new columns. I am experimenting with library(reshape2), dcast however I can't get the required output.
Is there a method that doesn't involve performing dcast multiple times then merging the resulting sets back together?
Current data frame:
observation=c(1,1,1,2,2,2,3,3,3)
event=c('event1','event2','event3','event1','event2','event3','event1','event2','event3')
value1=c(1,2,3,4,5,6,7,8,9)
value2=c(11,12,13,14,15,16,17,18,19)
current=data.frame(observation,event,value1,value2)
current
Required data frame:
observation=c(1,2,3)
event1_value1 =c(1,4,7)
event2_value1 =c(2,5,8)
event3_value1 =c(3,6,9)
event1_value2 =c(11,14,17)
event2_value2 =c(12,15,18)
event3_value2 =c(13,16,19)
required=data.frame(observation,event1_value1,event2_value1,event3_value1,event1_value2,event2_value2,event3_value2)
required
The method below works but I feel there must be a quicker way!
library(reshape2)
value1 <- dcast(current,observation~event,value.var ="value1")
value2 <- dcast(current,observation~event,value.var ="value2")
merge(value1,value2,by="observation",suffixes = c("_value1","_value2"))
This is an extension of reshape from long to wide
You can use the devel version of data.table i.e. v1.9.5 which can take multiple value.var columns. Instructions to install the devel version are here
library(data.table)#v1.9.5+
dcast(setDT(current), observation~event, value.var=c('value1', 'value2'))
# observation event1_value1 event2_value1 event3_value1 event1_value2
#1: 1 1 2 3 11
#2: 2 4 5 6 14
#3: 3 7 8 9 17
# event2_value2 event3_value2
#1: 12 13
#2: 15 16
#3: 18 19
Or reshape from base R
reshape(current, idvar='observation', timevar='event', direction='wide')
# observation value1.event1 value2.event1 value1.event2 value2.event2
#1 1 1 11 2 12
#4 2 4 14 5 15
#7 3 7 17 8 18
# value1.event3 value2.event3
#1 3 13
#4 6 16
#7 9 19
I'm not sure of the efficiency but you could try this -
> dcast(melt(current,id.vars = c('observation','event')),observation~event+variable)
observation event1_value1 event1_value2 event2_value1 event2_value2 event3_value1 event3_value2
1 1 1 11 2 12 3 13
2 2 4 14 5 15 6 16
3 3 7 17 8 18 9 19

How to calculate top rows from a large data set

I have a dataset in which there are following columns: flavor, flavorid and unitSoled.
Flavor Flavorid unitsoled
beans 350 6
creamy 460 2
.
.
.
I want to find top ten flavors and then calculate market share for each flavor. My logic is market share for each flavor = units soled for particular flavor divided by total units soled.
How do I implement this. For output I just want two col Flavorid and corresponding market share. Do I need to save top ten flavors in some table first?
One way is with the dplyr package:
An example data set:
flavor <- rep(letters[1:15],each=5)
flavorid <- rep(1:15,each=5)
unitsold <- 1:75
df <- data.frame(flavor,flavorid,unitsold)
> df
flavor flavorid unitsold
1 a 1 1
2 a 1 2
3 a 1 3
4 a 1 4
5 a 1 5
6 b 2 6
7 b 2 7
8 b 2 8
9 b 2 9
...
...
Solution:
library(dplyr)
df %>%
select(flavorid,unitsold) %>% #select the columns you want
group_by(flavorid) %>% #group by flavorid
summarise(total=sum(unitsold)) %>% #sum the total units sold per id
mutate(marketshare=total/sum(total)) %>% #calculate the market share per id
arrange( desc(marketshare)) %>% #order by marketshare descending
head(10) #pick the 10 first
#and you can add another select(flavorid,marketshare) if you only want those two
Output:
Source: local data frame [10 x 3]
flavorid total marketshare
1 15 365 0.12807018
2 14 340 0.11929825
3 13 315 0.11052632
4 12 290 0.10175439
5 11 265 0.09298246
6 10 240 0.08421053
7 9 215 0.07543860
8 8 190 0.06666667
9 7 165 0.05789474
10 6 140 0.04912281

Resources