Wrong histogram from data - r

I have the data frame new1 with 20 columns of variables one of which is new1$year. This includes 25 years with the following count:
> table(new1$year)
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
2770 3171 3392 2955 2906 2801 2930 2985 3181 3059 2977 2884 3039 2428 2653 2522 2558 2370 2666 3046 3155 3047 2941 2591 1580
I tried to prepare an histogram of this with
hist(new1$year, breaks=25)
but I obtain a histogram where the hight of the columns is actually different from the numbers in table(new1$year). FOr example the first column is >4000 in histo while it should be <2770; another example is that for 1995, where there should be a lower bar relatively to the other years around it this bar is also a little higher.
What am I doing wrong? I have tried to define numeric(new1$year) (error says 'invalid length argument') but with no different result.
Many thanks
Marco

Per my comment, try:
barplot(table(new1$year))
The reason hist does not work exactly as you intend has to do with specification of the breaks argument. See ?hist:
one of:
a vector giving the breakpoints between histogram cells,
a function to compute the vector of breakpoints,
a single number giving the number of cells for the histogram,
a character string naming an algorithm to compute the number of cells (see ‘Details’),
a function to compute the number of cells.
In the last three cases the number is a suggestion only.

Related

Add columns to other columns

I would like to take two columns and add them two other columns. For example, I have the data below:
EU.Member.States X. Other.countries..continued. X..1
Austria 122 Cameroon 203
Belgium 150 Canada 156
Denmark 179 Canary Islands 132
Finland 156 Cape Verde 147
France 130 Cayman Islands 213
How can I take the rows under "Other.countries..continued." and "X..1" and add them directly under "EU.Member.States" and "X." respectively?
I have tried using unite of (tidyr) with no success.
Your question is almost identical to this one. Using the piping from dplyr package I can suggest a solution by first duplicating your column names, and then applying classic rbind. I used only the first 2 lines of your example:
df %>% setNames(names(df)[c(1,2,1,2)]) %>% {rbind(.[,1:2], .[,3:4])}
#### EU.Member.States X.
#### 1 Austria 122
#### 2 Belgium 150
#### 3 Cameroon 203
#### 4 Canada 156
Note: the brackets are here to tell the piping not to take the . as an implicit first argument.

formatting data to display mulitple boxplots in R, also creating double y-axis in R

I have a data set that has observations of two variables (incidence (0-100) and severity (0-5) across 5 years. It looks something like this.
cbb.incidence avg.severity Year
1 86.666667 2.0333333 2009
2 83.333333 1.8666667 2009
3 20.000000 1.2000000 2009
4 26.666667 1.2666667 2010
5 86.666667 1.9000000 2010
6 86.666667 1.8666667 2010
7 86.666667 2.0333333 2011
8 83.333333 1.8666667 2011
9 20.000000 1.2000000 2012
10 26.666667 1.2666667 2012
11 86.666667 1.9000000 2013
12 86.666667 1.8666667 2013
What I want to get is a figure with two box-plots for each year, one of each variable. I found my exact same question here on stack overflow: Plot multiple boxplot in one graph
So I "melt" the data as they describe in the example, and then plot it as decribed:
meltedData<-melt(incidence_all, id.var='Year')
ggplot(data=meltedData, aes(x=Year, y=value)) +
geom_boxplot(aes(fill=variable))
The data appears to be in the correct format
The melted data looks like this (this is a subset, there are >2000 rows):
Year variable value
1017 2009 avg.severity 1.5333333
1018 2009 avg.severity 2.1333333
1019 2009 avg.severity 2.0666667
1020 2009 avg.severity 2.0000000
1021 2009 avg.severity 2.0666667
1022 2009 avg.severity 1.6333333
1023 2009 avg.severity 1.5666667
1024 2009 cbb.incidence 16.777775
1025 2009 cbb.incidence 35.888865
Will one you R-wizards please tell me what I'm doing wrong?
ALSO, I know already that my two variables are on very different scales (incidence is from 0-100, and severity is 1-5) so if I simply plot both with the same y-axis scale the smaller values will be un-readable. I would like have a double y-axis, one on the left and one on the right, with each variable being scaled to a different y-axis. I have not seen a box-plot example with this feature. Can someone make a recommendation of how to approach this, preferably in ggplot?
THANK YOU!!
Try making Year as factor first:
incidence_all$Year=factor(incidence_all$Year)
meltedData<-melt(incidence_all, id.var="Year")
ggplot(data=meltedData, aes(x=Year, y=value)) +
geom_boxplot(aes(fill=variable))
You will get something like this:
For the second question, one alternative would be to rescale:
incidence_all$avg.severitys=incidence_all$avg.severity*100/max(incidence_all$avg.severity)
meltedData<-melt(incidence_all[,-2], id.var="Year")
ggplot(data=meltedData, aes(x=Year, y=value)) +
geom_boxplot(aes(fill=variable))

Fuzzy string matching in r

I have 2 datasets with more than 100K rows each. I would like to merge them based on fuzzy string matching one column('movie title') as well as using release date. I am providing a sample from both datasets below.
dataset-1
itemid userid rating time title release_date
99991 1673 835 3 1998-03-27 mirage 1995
99992 1674 840 4 1998-03-29 mamma roma 1962
99993 1675 851 3 1998-01-08 sunchaser, the 1996
99994 1676 851 2 1997-10-01 war at home, the 1996
99995 1677 854 3 1997-12-22 sweet nothing 1995
99996 1678 863 1 1998-03-07 mat' i syn 1997
99997 1679 863 3 1998-03-07 b. monkey 1998
99998 1680 863 2 1998-03-07 sliding doors 1998
99999 1681 896 3 1998-02-11 you so crazy 1994
100000 1682 916 3 1997-11-29 scream of stone (schrei aus stein) 1991
dataset - 2
itemid userid rating time title release_date
1 2844 4477 3 2013-03-09 fantã´mas - 〠l'ombre de la guillotine 1913
2 4936 8871 4 2013-05-05 the bank 1915
3 4936 11628 3 2013-07-06 the bank 1915
4 4972 16885 4 2013-08-19 the birth of a nation 1915
5 5078 11628 2 2013-08-23 the cheat 1915
6 6684 4222 3 2013-08-24 the fireman 1916
7 6689 4222 3 2013-08-24 the floorwalker 1916
8 7264 2092 4 2013-03-17 the rink 1916
9 7264 5943 3 2013-05-12 the rink 1916
10 7880 11628 4 2013-07-19 easy street 1917
I have looked at 'agrep' but it only matches one string at a time. The 'stringdist' function is good but you need to run it in a loop, find the minimum distance and then go onto further precessing which is very time consuming given the size of the datasets. The strings can have typo's and special characters due to which fuzzy matching is required. I have looked around and found 'Lenenshtein' and 'Jaro-Winkler' methods. The later I read is good for when you have typo's in strings.
In this scenario, only fuzzy matching may not provide good results e.g., A movie title 'toy story' in one dataset can be matched to 'toy story 2' in the other which is not right. So I need to consider the release date to make sure the movies that are matched are unique.
I want to know if there is a way to achieve this task without using a loop? worse case scenario if I have to use a loop, how can I make it work efficiently and as fast as possible.
I have tried the following code but it has taken an awful amount of time to process.
for(i in 1:nrow(test))
for(j in 1:nrow(test1))
{
test$title.match <- ifelse(jarowinkler(test$x[i], test1$x[j]) > 0.85,
test$title, NA)
}
test - contains 1682 unique movie names converted to lower case
test1 - contains 11451 unique movie names converted to lower case
Is there a way to avoid the for loops and make it work faster?
What about this approach to move you forward? You can adjust the degree of match from 0.85 after you see the results. You could then use dplyr to group by the matched title and summarise by subtracting release dates. Any zeros would mean the same release date.
dataset-1$title.match <- ifelse(jarowinkler(dataset-1$title, dataset_2$title) > 0.85, dataset-1$title, NA)

Creating lag variables for matched factors

I have a question about creating lag variables depending on a time factor.
Basically I am working with a baseball dataset where there are lots of names for each player between 2002-2012. Obviously I only want lag variables for the same person to try and create a career arc to predict the current stat. Like for example I want to use lag 1 Average (2003) , lag 2 Average (2004) to try and predict the current average in 2005. So I tried to write a loop that goes through every row (the data frame is already sorted by name and then year, so the previous year is n-1 row), check if the name is the same, and if so then grab the value from the previous row.
Here is my loop:
i=2 #as 1 errors out with 1-0 row
for(i in 2:6264){
if(TS$name[i]==TS$name[i-1]){
TS$runvalueL1[i]=TS$Run_Value[i-1]
}else{
TS$runvalueL1 <- NA
}
i=i+1
}
Because each row is dependent on the name I cannot use most of the lag functions. If you have a better idea I am all ears!
Sample Data won't help a bunch but here is some:
edit: Sample data wasn't producing useable results so I just attached the first 10 people of my dataset. Thanks!
TS[(6:10),c('name','Season','Run_Value')]
name Season ARuns
321 Abad Andy 2003 -1.05
3158 Abercrombie Reggie 2006 27.42
1312 Abercrombie Reggie 2007 7.65
1069 Abercrombie Reggie 2008 5.34
4614 Abernathy Brent 2002 46.71
707 Abernathy Brent 2003 -2.29
1297 Abernathy Brent 2005 5.59
6024 Abreu Bobby 2002 102.89
6087 Abreu Bobby 2003 113.23
6177 Abreu Bobby 2004 128.60
Thank you!
Smth along these lines should do it:
names = c("Adams","Adams","Adams","Adams","Bobby","Bobby", "Charlie")
years = c(2002,2003,2004,2005,2004,2005,2010)
Run_value = c(10,15,15,20,10,5,5)
library(data.table)
dt = data.table(names, years, Run_value)
dt[, lag1 := c(NA, Run_value), by = names]
# names years Run_value lag1
#1: Adams 2002 10 NA
#2: Adams 2003 15 10
#3: Adams 2004 15 15
#4: Adams 2005 20 15
#5: Bobby 2004 10 NA
#6: Bobby 2005 5 10
#7: Charlie 2010 5 NA
An alternative would be to split the data by name, use lapply with the lag function of your choice and then combine the splitted data again:
TS$runvalueL1 <- do.call("rbind", lapply(split(TS, list(TS$name)), your_lag_function))
or
TS$runvalueL1 <- do.call("c", lapply(split(TS, list(TS$name)), your_lag_function))
But I guess there is also a nice possibility with plyr, but as you did not provide a reproducible example, that is all for the beginning.
Better:
TS$runvalueL1 <- unlist(lapply(split(TS, list(TS$name)), your_lag_function))
This is obviously not a problem where you want to create a matrix with cbind, so this is a better data structure:
full=data.frame(names, years, Run_value)
The ave function is quite useful for constructing new columns within categories of other columns:
full$Lag1 <- ave(full$Run_value, full$names,
FUN= function(x) c(NA, x[-length(x)] ) )
full
names years Run_value Lag1
1 Adams 2002 10 NA
2 Adams 2003 15 10
3 Adams 2004 15 15
4 Adams 2005 20 15
5 Bobby 2004 10 NA
6 Bobby 2005 5 10
7 Charlie 2010 5 NA
I thinks it's safer to cionstruct with NA, since that will help prevent errors in logic that using 0 for prior years in year 1 would not alert you to.

How to take the mean of last 10 values in a column before a missing value using R?

I am new to R and having trouble figuring out to go about this. I have data on tree growth rates from dead trees, organized by year. So, my first column is year and the columns to the right are growth rates for individual trees, ending in the year each tree died. After the tree died, the values are "NA" for the remaining years in the dataset. I need to take the mean growth for the 10 years preceding each tree's death, but each tree died in a different year. Does anyone have an idea for how to do this? Here is an example of what a dataset might look like:
Year Tree1 Tree2 Tree3
1989 53.00 84.58 102.52
1990 63.68 133.16 146.07
1991 90.37 103.10 233.58
1992 149.24 127.61 245.69
1993 96.20 54.78 417.96
1994 230.64 60.92 125.31
1995 150.81 60.98 100.43
1996 124.25 42.73 75.43
1997 173.42 67.20 50.34
1998 119.60 73.40 32.43
1999 179.97 61.24 NA
2000 114.88 67.43 NA
2001 82.23 55.23 NA
2002 49.40 NA NA
2003 93.46 NA NA
2004 104.67 NA NA
2005 44.14 NA NA
2006 88.40 NA NA
So, the averages I need to calculate are:
Tree1: mean(1997-2006) = 105.01
Tree2: mean(1992-2001) = 67.15
Tree3: mean(1989-1998) = 152.98
Since I need to do this for a large number of trees, it would be helpful to have a method of automating the calculation. Thank you very much for any help! Katie
You can use sapply and tail together with na.omit as follows:
sapply(mydf[-1], function(x) mean(tail(na.omit(x), 10)))
# Tree1 Tree2 Tree3
# 105.017 67.152 152.976
mydf[-1] says to drop the first column. tail has an argument, n, that lets you specify how many values you want from the end (tail) of your data. Here, we've set it to "10" since you want the last 10 values. Then, assuming that there are no NA values in your actual data from while the trees are alive, you can safely use na.omit on your data.

Resources