I was trying to mutate a new numeric column in a dataframe but the compliler is taking it as char's and i am not even able to access it using index - r

library(dslabs)
data(heights)
library(dplyr)
mutate(heights, ht_cm = height * 2.54, stringsAsFactor = FALSE )
str(heights) # not showing ht_cm as a variable in the data frame
mean(heights$ht_cm) # giving error that argument is not numeric

You just used mutate, but if you want to add the new column in height you need to:
Code
heights <-
heights %>%
mutate(ht_cm = height * 2.54)
Output
str(heights)
'data.frame': 1050 obs. of 3 variables:
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 1 1 1 1 2 ...
$ height: num 75 70 68 74 61 65 66 62 66 67 ...
$ ht_cm : num 190 178 173 188 155 ...

Related

if statement with three outcomes

I'd like to make a new column in which the value depends on other columns.
There are three possible outcomes
Distance < Min_disp = 0
Distance < Max_disp = Distance
Distance > Max_disp = Max_disp
I have tried using an if-statement, with multiple outcomes, but receive a warning.
Warning messages:
1: In if (Noord_2015_moved$Distance < Noord_2015_moved$Min_disp) { :
the condition has length > 1 and only the first element will be used
2: In if (Noord_2015_moved$Distance < Noord_2015_moved$Max_disp) { :
the condition has length > 1 and only the first element will be used
And indeed it only prints "Max_disp".
This is the code I've used
if (Noord_2015_moved$Distance < Noord_2015_moved$Min_disp) {
0
} else if (Noord_2015_moved$Distance < Noord_2015_moved$Max_disp) {
Noord_2015_moved$Distance
} else {
Noord_2015_moved$Max_disp
}
I have also tried running it in three separate steps, but then I run into the problem that I don't know how to tell R to only apply part of the df$column, because now I get the error
number of items to replace is not a multiple of replacement length
Noord_2015_moved <- mutate(Noord_2015_moved, Actual_disp = ifelse(Distance < Min_disp, 0, NA))
Noord_2015_moved$Actual_disp[Noord_2015_moved$Distance < Noord_2015_moved$Max_disp] <- Noord_2015_moved$Distance
Noord_2015_moved$Actual_disp[is.na(Noord_2015$Actual_disp)] <- Noord_2015_moved$Max_disp
And this is my data
'data.frame': 301 obs. of 15 variables:
$ Transmitter: Factor w/ 18 levels "A69-1601-22313",..: 1 1 1 1 1 1 1 2 2 2 ...
$ Date : Date, format: "2015-03-03" "2015-03-08" "2015-03-11" "2015-05-18" ...
$ Date_time : Factor w/ 279544 levels "1-03-15 0:00",..: 198302 258702 18684 85140 190788 182641 208718 26315 198759 205744 ...
$ Receiver : Factor w/ 17 levels "uitzetpunt 1-noord",..: 8 5 8 5 6 7 6 8 5 8 ...
$ Station : Factor w/ 17 levels "10","11","12",..: 15 12 15 12 13 14 13 15 12 15 ...
$ Traject : Factor w/ 53 levels "","10-10","10-9",..: 53 50 41 50 40 44 45 53 50 41 ...
$ Interval : num 83.4 12.7 42.6 25.2 217.4 ...
$ Distance : num 1540 6480 6480 6480 4690 4220 4220 1540 6480 6480 ...
$ Min_speed : num 0.02 0.51 0.15 0.26 0.02 0.73 0.52 0.01 0.02 0.02 ...
$ Min_speed2 : num 0.00556 0.14167 0.04167 0.07222 0.00556 ...
$ Length : int 47 47 47 47 47 47 47 45 45 45 ...
$ Activity : chr "Low" "Low" "Low" "Low" ...
$ Moved : chr "Yes" "Yes" "Yes" "Yes" ...
$ Min_disp : num 160 4080 1200 2080 160 5840 4160 80 160 160 ...
$ Max_disp : num 240 6120 1800 3120 240 8760 6240 120 240 240 ...
if() isn't vectorized. It work on a single condition, not a whole vector. That's what the warning "the condition has length > 1 and only the first element will be used" is telling you. You could use if() for this purpose, but you would need to put it in a for loop to check each row one-at-a-time. Doable, but not efficient.
ifelseis a vectorized version of if, and is good for a problem like this. For something like this, you would probably nest 2 ifelses:
Noord_2015_moved$Actual_disp = ifelse(
Noord_2015_moved$Distance < Noord_2015_moved$Min_disp, 0,
ifelse(Noord_2015_moved$Distance < Noord_2015_moved$Max_disp, Noord_2015_moved$Distance,
Noord_2015_moved$Max_disp
))
I see you have a single mutate. If you're using dplyr, you can use mutate which adds a column to the data frame and means you don't need to type out the data frame's name to reference existing columns. This code is equivalent to my above code:
Noord_2015_moved = Noord_2015_moved %>% mutate(
Acutal_disp = ifelse(Distance < Min_disp, 0,
ifelse(Distance < Max_disp, Distance, Max_disp)
)
)
In addition to using to ifelse multiple times, you can use dplyr::case_when, which handles multiple outcomes in the cleanest possible way:
Noord_2015_moved = Noord_2015_moved %>% mutate(
Acutal_disp = case_when(
Distance < Min_disp ~ 0,
Distance < Max_disp ~ Distance,
Distance > Max_disp ~ Max_disp,
TRUE ~ NA_real_
)
)
Here is a short reference.

Measuring distance between centroids R

I want to create a matrix of the distance (in metres) between the centroids of every country in the world. Country names or country IDs should be included in the matrix.
The matrix is based on a shapefile of the world downloaded here: http://gadm.org/version2
Here is some rough info on the shapefile I'm using (I'm using shapefile#data$UN as my ID):
> str(shapefile#data)
'data.frame': 174 obs. of 11 variables:
$ FIPS : Factor w/ 243 levels "AA","AC","AE",..: 5 6 7 8 10 12 13
$ ISO2 : Factor w/ 246 levels "AD","AE","AF",..: 61 17 6 7 9 11 14
$ ISO3 : Factor w/ 246 levels "ABW","AFG","AGO",..: 64 18 6 11 3 10
$ UN : int 12 31 8 51 24 32 36 48 50 84 ...
$ NAME : Factor w/ 246 levels "Afghanistan",..: 3 15 2 11 6 10 13
$ AREA : int 238174 8260 2740 2820 124670 273669 768230 71 13017
$ POP2005 : int 32854159 8352021 3153731 3017661 16095214 38747148
$ REGION : int 2 142 150 142 2 19 9 142 142 19 ...
$ SUBREGION: int 15 145 39 145 17 5 53 145 34 13 ...
$ LON : num 2.63 47.4 20.07 44.56 17.54 ...
$ LAT : num 28.2 40.4 41.1 40.5 -12.3 ...
I tried this:
library(rgeos)
shapefile <- readOGR("./Map/Shapefiles/World/World Map", layer = "TM_WORLD_BORDERS-0.3") # Read in world shapefile
row.names(shapefile) <- as.character(shapefile#data$UN)
centroids <- gCentroid(shapefile, byid = TRUE, id = as.character(shapefile#data$UN)) # create centroids
dist_matrix <- as.data.frame(geosphere::distm(centroids))
The result looks something like this:
V1 V2 V3 V4
1 0.0 4296620.6 2145659.7 4077948.2
2 4296620.6 0.0 2309537.4 219442.4
3 2145659.7 2309537.4 0.0 2094277.3
4 4077948.2 219442.4 2094277.3 0.0
1) Instead of the first column (1,2,3,4) and row (V1, V2, V3, V4) I would like to have country IDs (shapefile#data$UN) or names (shapefile#data#NAME). How does that work?
2) I'm not sure of the value that is returned. Is it metres, kilometres, etc?
3) Is geosphere::distm preferable to geosphere:distGeo in this instance?
1.
This should work to add the column and row names to your matrix. Just as you had done when adding the row names to shapefile
crnames<-as.character(shapefile#data$UN)
colnames(dist_matrix)<- crnames
rownames(dist_matrix)<- crnames
2.
The default distance function in distm is distHaversine, which takes a radius( of the earth) variable in m. So I assume the output is in m.
3.
Look at the documentation for distGeo and distHaversine and decide the level of accuracy you want in your results. To look at the docs in R itself just enter ?distGeo.
edit: answer to q1 may be wrong since the matrix data may be aggregated, looking at alternatives

Observations becoming NA when ordering levels of factors in R with ordered()

Hi have a longitudinal data frame p that contains 4 variables and looks like this:
> head(p)
date.1 County.x providers beds price
1 Jan/2011 essex 258 5545 251593.4
2 Jan/2011 greater manchester 108 3259 152987.7
3 Jan/2011 kent 301 7191 231985.7
4 Jan/2011 tyne and wear 103 2649 143196.6
5 Jan/2011 west midlands 262 6819 149323.9
6 Jan/2012 essex 2 27 231398.5
The structure of my variables is the following:
'data.frame': 259 obs. of 5 variables:
$ date.1 : Factor w/ 66 levels "Apr/2011","Apr/2012",..: 23 23 23 23 23 24 24 24 25 25 ...
$ County.x : Factor w/ 73 levels "avon","bedfordshire",..: 22 24 32 65 67 22 32 67 22 32 ...
$ providers: int 258 108 301 103 262 2 9 2 1 1 ...
$ beds : int 5545 3259 7191 2649 6819 27 185 24 70 13 ...
$ price : num 251593 152988 231986 143197 149324 ...
I want to order date.1 chronologically. Prior to apply ordered(), this variable does not contain NA observations.
> summary(is.na(p$date.1))
Mode FALSE NA's
logical 259 0
However, once I apply my function for ordering the levels corresponding to date.1:
p$date.1 = with(p, ordered(date.1, levels = c("Jun/2010", "Jul/2010",
"Aug/2010", "Sep/2010", "Oct/2010", "Nov/2010", "Dec/2010", "Jan/2011", "Feb/2011",
"Mar/2011","Apr/2011", "May/2011", "Jun/2011", "Jul/2011", "Aug/2011", "Sep/2011",
"Oct/2011", "Nov/2011", "Dec/2011" ,"Jan/2012", "Feb/2012" ,"Mar/2012" ,"Apr/2012",
"May/2012", "Jun/2012", "Jul/2012", "Aug/2012", "Sep/2012", "Oct/2012", "Nov/2012",
"Dec/2012", "Jan/2013", "Feb/2013", "Mar/2013", "Apr/2013", "May/2013",
"Jun/2013", "Jul/2013", "Aug/2013", "Sep/2013", "Oct/2013", "Nov/2013",
"Dec/2013", "Jan/2014",
"Feb/2014", "Mar/2014", "Apr/2014", "May/2014", "Jun/2014", "Jul/2014" ,"Aug/2014",
"Sep/2014", "Oct/2014", "Nov/2014", "Dec/2014", "Jan/2015", "Feb/2015", "Mar/2015",
"Apr/2015","May/2015", "Jun/2015" ,"Jul/2015" ,"Aug/2015", "Sep/2015", "Oct/2015",
"Nov/2015")))
It seems I miss some observations.
> summary(is.na(p$date.1))
Mode FALSE TRUE NA's
logical 250 9 0
Has anyone come across with this problem when using ordered()? or alternatively, is there any other possible solution to group my observations chronologically?
It is possible that one of your p$date.1 doesn't matched to any of the levels. Try this ord.monas the levels.
ord.mon <- do.call(paste, c(expand.grid(month.abb, 2010:2015), sep = "/"))
Then, you can try this to see if there's any mismatch between the two.
p$date.1 %in% ord.mon
Last, You can also sort the data frame after transforming the date.1 columng into Date (Note that you have to add an actual date beforehand)
p <- p[order(as.Date(paste0("01/", p$date.1), "%d/%b/%Y")), ]

How to plot graph with certain requirement of choosing of x-axis in data table in R?

I have a data frame as following:
>str(df)
'data.frame': 22673 obs. of 6 variables:
$ V1 : Factor w/ 39 levels "2015-02-09","2015-02-09 ",..: 1 1 1 1 1 1 1 1 1 1 ...
$ V2 : Factor w/ 10465 levels "00:48:26","01:49:26",..: 3949 3956 3964 3985 4196 4254 4262 4268 4275 4309 ...
$ V3 : Factor w/ 3 levels "Admin","AmbassadorSchoolPlayer",..: 3 3 3 3 3 3 3 3 1 3 ...
$ V4 : Factor w/ 104 levels "1builder1","22mAsgarfus",..: 77 77 57 77 48 48 48 48 6 77 ...
$ V5 : Factor w/ 8580 levels ""," - -?"," - 14 1",..: 2306 874 7433 3650 2306 2306 3364 6501 3257 2306 ...
df$V4 is the user_name, and I'd like to plot the graph which takes df$V1 as x-axis, df$V4 as y-axis. But given the number of user is too big, I 'd like to choose the ones(user-name) who appear for more than a threshold times, let's say, 10, in the data frame. How can I do it? I am quite new to R, and I have read several article introducing ggplot2, but did not find the answer. Thank you in advance.
use the table function
count <- table(df$V4)
subset which usernames with more than 10 entries
some_usernames <- names(count[count>10])
then subset your dataframe
df_subset <- df[df$V4 %in% some_usernames, ]
then use ggplot2 or base graphics to do what you want. Hope this helps.

can't draw the grouped value above stacked bar plot in ggplot2

I have a ggplot2 question, I run the code below show the stacked barplot without add value above each bar correctly:
p=ggplot(data=essnn)
p+geom_bar(binwidth=0.5,stat="identity")+ #
aes(x=reorder(classname,-amount,sum), y=amount, label=amount, fill = sort(year))+
theme()
I want add the sum amount grouped by year in each class, and here is my code:
+geom_text(aes(x=classes,y=total,label=total), data=essnnta, fill=NULL, size=3)
But an error message appear:
Error in fill = year, can not find object "year"
That's my problem: why the object "year" can be found when I draw stack bar plot without add the sum amount grouped by year in each class, but when I add the sum amount grouped by year, the error appear?
> str(essnn)
'data.frame': 48619 obs. of 15 variables:
$ id : int 2006051337 2006051337 2006051337 2006051337 2006051337 2006051337 2004070648 2006031360 2006031360 2004070062 ...
$ gender : Factor w/ 3 levels "","F","M": 3 3 3 3 3 3 3 3 3 3 ...
$ age : num 30 30 30 30 30 30 38 43 43 37 ...
$ class : Factor w/ 92 levels "100ab","100aa",..: 18 18 18 18 18 18 18 18 18 18 ...
$ classname: Factor w/ 1136 levels "cad"," Office2010",..: 111 111 111 111 111 111 116 107 107 107 ...
$ grade : num 7 5 6 8 3 4 1 4 3 2 ...
$ year : Factor w/ 6 levels "98","99","100",..: 3 3 3 3 2 2 4 5 5 3 ...
$ ses : num 212 210 211 213 207 208 217 221 220 210 ...
$ date : int 1010421 1001115 1010214 1010701 1000411 1000627 1020424 1030304 1021121 1001108 ...
$ money : num 5800 5800 5800 5800 5200 5200 3000 0 5500 5500 ...
$ discount : num 1160 1160 1160 1160 1040 1040 600 0 275 550 ...
$ amount : num 4640 4640 4640 4640 4160 ...
$ idc : Factor w/ 7 levels "在校生","校友",..: 2 2 2 2 2 2 2 7 7 7 ...
$ mdy : Date, format: "2012-04-21" "2011-11-15" "2012-02-14" "2012-07-01" ...
$ day : num 1123 1281 1190 1052 1499 ...
> str(essnnta)
'data.frame': 10 obs. of 2 variables:
$ classes: Factor w/ 10 levels "JD","JF",..: 1 7 8 4 6 10 3 5 2 9
$ total : num 55603526 43708950 43555010 35649129 33214372 ...
Your problem might be that your x-axes are not the same in the two data frames. So ggplot does not know which value corresponds with which stack. I am not sure about this as I don't understand the way you define your x axis in the original barplot. I also find it a bit strange to define the aes outside of the ggplot function or the geom_bar. But that might just be me be used to a different kind of syntax.
All in all I find it difficult to help you as you do not provide any reproducible example.
Here is a small bit of data, and a plot that sort of works. If you supplement your question with your data (or a subset of it), see if this works. You may also want to position the label at the top of each bar.
essnn <- data.frame(year = c(98,99,100,101,102),
classname = c("a", "b", "c", "d", "e"),
amount = c(1e6, 2e6,3e6,4e6,5e6))
essnnta <- data.frame(total = c(10, 20, 30, 40, 50))
ggplot(data=essnn, aes(x=reorder(classname,-amount, sum), y=amount, fill = year)) +
geom_bar(binwidth=0.5, stat="identity", position = "stack") +
geom_text(aes(x=essnn$classname, y=essnnta$total, label=essnnta$total), size=3) # not "classes"

Resources