I have a dataframe with SAT scores for all states in US.
'data.frame': 51 obs. of 7 variables:
$ X2010.rank : int 1 2 3 4 5 6 7 8 9 10 ...
$ state : chr "Iowa " "Minnesota " "Wisconsin " "Missouri " ...
$ reading : int 603 594 595 593 585 592 585 590 585 580 ...
$ math : int 613 607 604 595 605 603 600 595 593 594 ...
$ writing : int 582 580 579 580 576 571 577 567 568 559 ...
$ combined : int 1798 1781 1778 1768 1766 1766 1762 1752 1746 1733 ...
$ participation: chr "3%" "7%" "4%" "4%" ...
I need to find the index of a particular state. I tried the which command but its returning integer(0)
> which(sat$state=="California")
integer(0)
However this command is working for other rows and getting me the index:
> which(sat$combined==1781)
[1] 2
where am I going wrong. Please help.
Related
I want to use specific 3 years to analysis, so I create a vector "score_3y".
When I use "score1" only, it display correctly.
When I use score1_3y, it display nothing, and shows:
Error in `check_aesthetics()`:
! Aesthetics must be either length 1 or the same as the data (54): x
Run `rlang::last_error()` to see where the error occurred.
Warning message:
`guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
What is the problem?
Here is the code:
score1_3y <- score1[year == 2020 | year == 2021 | year == 2022]
ggplot(kaoyan, aes(score1_3y, fill = major))+
geom_density(alpha = 0.6)+
facet_wrap(~major)
str(kaoyan)
tibble [54 x 11] (S3: tbl_df/tbl/data.frame)
$ college : chr [1:54] "SUDA" "SUDA" "SUDA" "SUDA" ...
$ applicants: num [1:54] 87 87 87 87 87 87 87 87 87 87 ...
$ admission : num [1:54] 11 11 11 11 11 11 11 11 11 11 ...
$ ratio : num [1:54] 7.91 7.91 7.91 7.91 7.91 ...
$ exemption : num [1:54] 3 3 3 3 3 3 3 3 3 3 ...
$ major : Factor w/ 2 levels "情报学","档案学": 1 1 1 1 1 1 1 2 2 2 ...
$ year : Factor w/ 5 levels "2018","2019",..: 4 4 4 4 4 4 4 4 4 4 ...
$ score1 : num [1:54] 416 410 377 358 358 364 344 403 400 406 ...
$ score2 : num [1:54] 409 408 378 390 387 372 385 401 398 392 ...
$ score3 : num [1:54] 825 818 755 748 745 736 729 804 798 798 ...
$ gjx : num [1:54] 341 341 341 341 341 341 341 341 341 341 ...
The reason is that we are using original dataset with the subset of rows object in aes, which obviously cause a length difference. Instead, just filter or subset the data and use 'score1'
library(dplyr)
library(ggplot2)
kaoyan %>%
filter(year %in% 2020:2022) %>%
ggplot(aes(score1, fill = major)) +
geom_density(alpha = 0.6) +
facet_wrap(~major)
I have some problems summarise function in "dplyr" package.
This is the code.
library("dplyr")
a <- read.csv("Number of subway passengers.csv",header = T, stringsAsFactor = F)
a <- a[,c(-2,-3,-4,-5)]
colnames(a)=c("Date","4-5","5-6","6-7","7-8","8-9","9-10","10-11","11-12","12-13","13-14","14-15","15-
16","16-17","17-18","18-19","19-20","20-21","21-22","22-23","23-24","0-1","1-2","2-3","3-4","Total")
b <- summarise(a,mean_passenger=mean("Total",na.rm=TRUE))
After running the last code I have some error in summarise.
In mean.default("Total", na.rm = TRUE) : argument is not numeric or logical:returning NA
Why does this error occur?
I attach the result of using the function str.
> str(a)
'data.frame': 16501 obs. of 26 variables:
$ Date : chr "2019-11-01" "2019-11-01" "2019-11-01" "2019-11-01" ...
$ 4-5 : int 32 2 3 0 5 0 11 1 2 0 ...
$ 5-6 : int 438 353 89 182 143 211 187 127 83 175 ...
$ 6-7 : int 529 2019 152 852 161 1078 154 477 115 622 ...
$ 7-8 : int 1612 4520 289 2926 288 4395 302 1044 219 1817 ...
$ 8-9 : int 3405 9906 435 9348 482 13000 386 3662 366 5234 ...
$ 9-10 : int 2360 6525 481 4124 631 6669 550 3510 494 3292 ...
$ 10-11 : int 2377 3571 716 2064 768 2964 841 2593 843 2292 ...
$ 11-12 : int 2853 2951 1090 1889 1359 2501 1686 2813 1262 2349 ...
$ 12-13 : int 3334 3190 1073 1538 1531 2127 1781 2646 1583 2160 ...
$ 13-14 : int 3545 3348 1367 1751 1937 2108 2059 2718 1868 2159 ...
$ 14-15 : int 2850 3179 1782 1403 2466 1926 2405 2579 2303 2071 ...
$ 15-16 : int 4606 3265 2235 1431 2821 1718 3125 2103 2479 1559 ...
$ 16-17 : int 4915 3575 2345 1218 3403 1778 3241 2010 2656 1777 ...
$ 17-18 : int 7472 4191 3627 1249 5807 2396 3796 2033 3583 1599 ...
$ 18-19 : int 11107 5445 7462 1486 10738 3746 4836 2582 5246 1776 ...
$ 19-20 : int 5754 3882 2943 816 4680 2557 3192 1682 2709 1261 ...
$ 20-21 : int 3920 2596 2249 439 3670 935 2107 675 1782 548 ...
$ 21-22 : int 3799 2177 2199 288 4495 510 2452 512 1565 341 ...
$ 22-23 : int 3369 1624 1460 296 4118 384 2407 380 1094 260 ...
$ 23-24 : int 1678 912 640 202 2366 299 1394 323 596 153 ...
$ 0-1 : int 228 478 62 47 271 75 236 143 66 73 ...
$ 1-2 : int 2 39 0 1 1 0 6 10 1 1 ...
$ 2-3 : int 0 0 0 0 0 0 0 0 0 0 ...
$ 3-4 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Total : int 70185 67748 32699 33550 52141 51377 37154 34623 30915 31519 ...
"Total" is interpreted as a string. We can reproduce the same error with
mean("Total")
#[1] NA
Warning message:
In mean.default("Total") : argument is not numeric or logical: returning NA
We need to use Total without quotes to be interpreted as column.
b <- dplyr::summarise(a, mean_passenger = mean(Total,na.rm=TRUE))
I have tried everything I can think of to fix this error but I have not been able to figure it out. 32 bit machine, trying to build a choropleth. The data file is pretty basic some municipal IDs with population figures associated with it. The shape file is taken from here: www.ontario.ca/data/municipal-boundaries
library('tmap')
library('leaflet')
library('magrittr')
library('rio')
library('plyr')
library('scales')
library('htmlwidgets')
library('tmaptools')
setwd("C:/Users/rdhasa/desktop")
datafile <- "shapefiles2/Population - 2014.csv"
Pop2014 <- rio::import(datafile)
Pop2014$Population <- as.factor(Pop2014$Population)
str(Pop2014)
'data.frame': 454 obs. of 9 variables:
$ MUNID : int 20002 18000 18013 18001 18005 18017 18009 18039 18020 18029 ...
$ YEAR : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
$ MAH CODE : int 1106 10000 10101 10102 10401 10402 10404 10601 10602 10603 ...
$ V4 : int 1999 1800 1813 1801 1805 1817 1809 1839 1820 1829 ...
$ Municipality: chr "Toronto C" "Durham R" "Oshawa C" "Pickering C" ...
$ Tier : chr "ST" "UT" "LT" "LT" ...
$ A : int 11 11 11 11 11 11 11 11 11 11 ...
$ B : chr "a" "a" "a" "a" ...
$ Population : Factor w/ 438 levels "-","1,006","1,026",..: 160 359 117 432 86 419 97 73 179 171 ...
mnshape <- "shapefiles2/MUNICIPAL_BOUNDARY_LOWER_AND_SINGLE_TIER.shp"
mngeo2 <- read_shape(file=mnshape)
str(mngeo2#data)
'data.frame': 683 obs. of 13 variables:
$ MUNID : int 1002 1002 1002 1009 1009 1009 1016 1016 1016 1026 ...
$ MAH_CODE : int 71616 71616 71616 71618 71618 71618 71614 71614 71614 71613 ...
$ SGC_CODE : int 1005 1005 1005 1011 1011 1011 1020 1020 1020 1030 ...
$ ASSESSMENT: int 101 101 101 406 406 406 506 506 506 511 ...
$ LEGAL_NAME: Factor w/ 414 levels "CITY OF BARRIE",..: 369 369 369 370 370 370 96 96 96 334 ...
$ STATUS : Factor w/ 2 levels "LOWER TIER","SINGLE TIER": 1 1 1 1 1 1 1 1 1 1 ...
$ EXTENT : Factor w/ 3 levels "ISLANDS","LAND",..: 1 2 3 1 2 3 1 2 3 2 ...
$ MSO : Factor w/ 4 levels "CENTRAL","EASTERN",..: 2 2 2 2 2 2 2 2 2 2 ...
$ NAME_PREFI: Factor w/ 8 levels "-","CITY OF",..: 6 6 6 6 6 6 4 4 4 6 ...
$ UPPER_TIER: Factor w/ 30 levels "BRUCE","DUFFERIN",..: 27 27 27 27 27 27 27 27 27 27 ...
$ NAME : Factor w/ 413 levels "ADDINGTON HIGHLANDS",..: 339 339 339 342 342 342 337 337 337 259 ...
$ Shape_Leng: num 0.115 1.622 1.563 0.551 1.499 ...
$ Shape_Area: num 2.32e-05 6.95e-02 7.51e-03 5.63e-04 5.09e-02 ...
mnmap <- append_data(mngeo2, Pop2014, key.shp = "MUNID", key.data="MUNID")
minPct <- min(c(mnmap#data$Population))
maxPct <- max(c(mnmap#data$Population))
paletteLayers <- colorBin(palette = "RdBu", domain = c(minPct, maxPct), bins = c(0, 50000,200000 ,500000, 1000000, 2000000) , pretty=FALSE)
rm(mngeo2)
rm(Pop2014)
rm(mnshape)
rm(datafile)
rm(maxPct)
rm(minPct)
gc()
leaflet(mnmap) %>%
addProviderTiles("CartoDB.Positron") %>%
addPolygons(stroke=TRUE,
smoothFactor = 0.2,
weight = 1,
fillOpacity = .6)
Error: cannot allocate vector of size 177.2 Mb
Is there I can maybe safe space through simplfying the shape file. If so how would I go about doing this efficiently?
THanks
I have a list of 170 items, each with 12 variables. This data is currently organised in one continuous row (1 observations of 2040 variables), e.g.:
0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2
but I want it to be organised into 170 columns with 12 rows as follows:
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
0 1 2
I have tried the following:
list2=lapply(list1, function(x) t(x))
but this doesn't alter the data in any way. Is there something else I can do to transform the data?
We convert the string to a vector of numeric elements with scan, split the vector by itself to create a list and convert it to a data.frame
v2 <- scan(text=v1, what=numeric(), quiet=TRUE)
data.frame(split(v2, v2))
If your data is already converted into a vector (as #akrun showed with using scan) you could also do:
data <- 1:2040 # your data
breaks <- seq(1, 2040, 170)
result <- lapply(breaks, function(x) data[x : (x + 169)])
Results in
> str(result)
List of 12
$ : int [1:170] 1 2 3 4 5 6 7 8 9 10 ...
$ : int [1:170] 171 172 173 174 175 176 177 178 179 180 ...
$ : int [1:170] 341 342 343 344 345 346 347 348 349 350 ...
$ : int [1:170] 511 512 513 514 515 516 517 518 519 520 ...
$ : int [1:170] 681 682 683 684 685 686 687 688 689 690 ...
$ : int [1:170] 851 852 853 854 855 856 857 858 859 860 ...
$ : int [1:170] 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 ...
$ : int [1:170] 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 ...
$ : int [1:170] 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 ...
$ : int [1:170] 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 ...
$ : int [1:170] 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 ...
$ : int [1:170] 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 ...
I'm running a straightforward linear regression model fit on the following dataframe:
> str(model_data_rev)
'data.frame': 128857 obs. of 12 variables:
$ ENTRY_4 : num 186 218 208 235 256 447 471 191 207 250 ...
$ ENTRY_8 : num 724 769 791 777 707 237 236 726 773 773 ...
$ ENTRY_12: num 2853 2989 3174 3027 3028 ...
$ ENTRY_16: num 2858 3028 3075 2992 3419 ...
$ ENTRY_20: num 7260 7188 7587 7560 7165 ...
$ EXIT_4 : num 70 82 105 114 118 204 202 99 73 95 ...
$ EXIT_8 : num 1501 1631 1594 1576 1536 ...
$ EXIT_12 : num 3862 3923 4158 3970 3895 ...
$ EXIT_16 : num 1559 1539 1737 1681 1795 ...
$ EXIT_20 : num 2145 2310 2217 2330 2291 ...
$ DAY : Ord.factor w/ 7 levels "Sun"<"Mon"<"Tues"<..: 2 3 4 5 6 7 1 2 3 4 ...
$ MONTH : Ord.factor w/ 12 levels "Jan"<"Feb"<"Mar"<..: 3 3 3 3 3 3 3 3 3 3 ...
I split the data in to training and test sets as follows using the caret package:
split<-createDataPartition(y = model_data_rev$EXIT_20, p = 0.7, list = FALSE)
d_training = model_data_rev[split,]
d_test = model_data_rev[-split,]
I train the model using the train function in the caret package:
ctrl<-trainControl(method = 'cv',number = 5)
lmCVFit<-train(EXIT_20 ~ ., data = d_training, method = 'lm', trControl = ctrl, metric='Rsquared')
summary(lmCVFit)
When I run summary(lmCVFit) I get the following error:
Error in summary.lm(object$finalModel, ...) :
length of 'dimnames' [1] not equal to array extent
In addition: Warning message:
In cbind(est, se, tval, 2 * pt(abs(tval), rdf, lower.tail = FALSE)) :
number of rows of result is not a multiple of vector length (arg 1)
I thought it might be the related to the my initial dataframe above. Specifically, i thought it could have to do with the factor variables. So I cut them off (not shown), ran everything again, and got the same error.
I also ran the regression without CV using the 'lm' function in R and got the same error when I ran summary()
Has anyone seen this and can anyone help? I can't find anything on line that speaks to this error in the context of regression.
Thanks in advance.
EDIT
I modified the ordinal variable to standard character variables. Structure now looks like this:
> str(model_data_rev)
'data.frame': 128857 obs. of 12 variables:
$ ENTRY_4 : num 186 218 208 235 256 447 471 191 207 250 ...
$ ENTRY_8 : num 724 769 791 777 707 237 236 726 773 773 ...
$ ENTRY_12: num 2853 2989 3174 3027 3028 ...
$ ENTRY_16: num 2858 3028 3075 2992 3419 ...
$ ENTRY_20: num 7260 7188 7587 7560 7165 ...
$ EXIT_4 : num 70 82 105 114 118 204 202 99 73 95 ...
$ EXIT_8 : num 1501 1631 1594 1576 1536 ...
$ EXIT_12 : num 3862 3923 4158 3970 3895 ...
$ EXIT_16 : num 1559 1539 1737 1681 1795 ...
$ EXIT_20 : num 2145 2310 2217 2330 2291 ...
$ DAY : Factor w/ 7 levels "Friday","Monday",..: 2 6 7 5 1 3 4 2 6 7 ...
$ MONTH : Factor w/ 12 levels "April","August",..: 8 8 8 8 8 8 8 8 8 8 ...
I still get the error when running summary after fitting the model.
It is also important emphasize that the model fitting works without throwing an error. It is summary() that is throwing off the error.
Thanks.