In the house price prediction dataset, there are about 80 variables and 1459 obs.
To understand the data better, I have segregated the variables which are 'char' type.
char_variables = sapply(property_train, is.character)
char_names = names(property_train[,char_variables])
char_names
There are 42 variables that are char datatype.
I want to find the number of observations in each variable.
The simple code for that would be:
table(property_train$Zoning_Class)
Commer FVR RHD RLD RMD
10 65 16 1150 218
But repeating the same for 42 variables would be a tedious task.
The for loops I've tried to print all the tables show error.
for (val in char_names){
print(table(property_train[[val]]))
}
Abnorml AdjLand Alloca Family Normal Partial
101 4 12 20 1197 125
Is there a way to iterate the char_names through the dataframe to print all 42 tables.
str(property_train)
'data.frame': 1459 obs. of 81 variables:
$ Id : int 1 2 3 4 5 6 7 8 9 10 ...
$ Building_Class : int 60 20 60 70 60 50 20 60 50 190 ...
$ Zoning_Class : chr "RLD" "RLD" "RLD" "RLD" ...
$ Lot_Extent : int 65 80 68 60 84 85 75 NA 51 50 ...
$ Lot_Size : int 8450 9600 11250 9550 14260 14115 10084 10382..
$ Road_Type : chr "Paved" "Paved" "Paved" "Paved" ...
$ Lane_Type : chr NA NA NA NA ...
$ Property_Shape : chr "Reg" "Reg" "IR1" "IR1" ...
$ Land_Outline : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
Actually, for me your code does not give an error (make sure to evaluate all lines in the for-loop together):
property_train <- data.frame(a = 1:10,
b = rep(c("A","B"),5),
c = LETTERS[1:10])
char_variables = sapply(property_train, is.character)
char_names = names(property_train[,char_variables])
char_names
table(property_train$b)
for (val in char_names){
print(table(property_train[val]))
}
You can also get this result in a bit more user-friendy form using dplyr and tidyr by pivoting all the character columns into a long format and counting all the column-value combinations:
library(dplyr)
library(tidyr)
property_train %>%
select(where(is.character)) %>%
pivot_longer(cols = everything(), names_to = "column") %>%
group_by(column, value) %>%
summarise(freq = n())
Related
I am trying to tidy a dataset where I measured exposure at two different stations by time (in seconds). I have a data frame which has column 1 as Second, column 2 as SiteA_Number (corresponding to the number of particles at SiteA), SiteA_Diamater (diameter of particles at SiteA), SiteA_LDSA (LDSA at SiteA), and the same measurements for SiteB as 3 more columns (SiteB_Number, SiteB_Diam, SiteB_LDSA).
I would like my dataset to transform to have a column for the Seconds, columns for Number, Diameter, and LDSA, and a separate column for the station (SiteA or SiteB). That way, I can plot a graph with Number (y axis) over time (seconds) and fill by site.
The structure of each column is as follows:
'data.frame': 1800 obs. of 7 variables:
$ Second: num 1 2 3 4 5 6 7 8 9 10 ...
$ SiteA_Number : int 16673 19891 20370 17513 18185 18982 18362 17579 16605 15590 ...
$ SiteA_Diam : int 41 39 38 42 41 39 40 42 44 45 ...
$ SiteA_LDSA : num 36.1 40.4 40.7 38.6 38.8 ...
$ SiteB_Number: int 15554 16745 17719 16494 15811 15331 16053 16196 15733 15521 ...
$ SiteB_Diam : int 40 39 37 40 42 44 42 42 42 43 ...
$ SiteB_LDSA : num 33 33.8 34.3 34.2 35.2 ...
I tried using pivot_longer to create a station column and then corresponding columns for the number, diameter, and LDSA:
MergedLDSA %>%
pivot_longer(-Second,
names_to =c("Station", ".value"),
names_sep = ("_"),
names_transform = list(
Number = as.integer,
Diameter = as.integer,
LDSA = as.integer,
Station = as.character())
)
But I get the error message:
Error in `map()`:
! Can't convert
`.x[[i]]`, an empty
character vector, to
a function.
I then tried using the separate() function:
MergedLDSA %>%
separate(c(SiteA_Number, SiteA_Diam, SiteA_LDSA, SiteB_Number, SiteB_Diam, SiteB_LDSA), into = c("Station", ".value"), sep = "_")
But I get the error message:
Error:
! Must extract column with a single valid subscript.
x Subscript `var` has size 6 but must be size 1.
I'm fairly beginner at coding and this is my first time trying to tidy real data. I do not understand the errors and cannot figure out how to tidy my data the way I'd like.
Any help would be greatly appreciated! :)
library(dslabs)
data(heights)
library(dplyr)
mutate(heights, ht_cm = height * 2.54, stringsAsFactor = FALSE )
str(heights) # not showing ht_cm as a variable in the data frame
mean(heights$ht_cm) # giving error that argument is not numeric
You just used mutate, but if you want to add the new column in height you need to:
Code
heights <-
heights %>%
mutate(ht_cm = height * 2.54)
Output
str(heights)
'data.frame': 1050 obs. of 3 variables:
$ sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 1 1 1 1 2 ...
$ height: num 75 70 68 74 61 65 66 62 66 67 ...
$ ht_cm : num 190 178 173 188 155 ...
I'd like to make a new column in which the value depends on other columns.
There are three possible outcomes
Distance < Min_disp = 0
Distance < Max_disp = Distance
Distance > Max_disp = Max_disp
I have tried using an if-statement, with multiple outcomes, but receive a warning.
Warning messages:
1: In if (Noord_2015_moved$Distance < Noord_2015_moved$Min_disp) { :
the condition has length > 1 and only the first element will be used
2: In if (Noord_2015_moved$Distance < Noord_2015_moved$Max_disp) { :
the condition has length > 1 and only the first element will be used
And indeed it only prints "Max_disp".
This is the code I've used
if (Noord_2015_moved$Distance < Noord_2015_moved$Min_disp) {
0
} else if (Noord_2015_moved$Distance < Noord_2015_moved$Max_disp) {
Noord_2015_moved$Distance
} else {
Noord_2015_moved$Max_disp
}
I have also tried running it in three separate steps, but then I run into the problem that I don't know how to tell R to only apply part of the df$column, because now I get the error
number of items to replace is not a multiple of replacement length
Noord_2015_moved <- mutate(Noord_2015_moved, Actual_disp = ifelse(Distance < Min_disp, 0, NA))
Noord_2015_moved$Actual_disp[Noord_2015_moved$Distance < Noord_2015_moved$Max_disp] <- Noord_2015_moved$Distance
Noord_2015_moved$Actual_disp[is.na(Noord_2015$Actual_disp)] <- Noord_2015_moved$Max_disp
And this is my data
'data.frame': 301 obs. of 15 variables:
$ Transmitter: Factor w/ 18 levels "A69-1601-22313",..: 1 1 1 1 1 1 1 2 2 2 ...
$ Date : Date, format: "2015-03-03" "2015-03-08" "2015-03-11" "2015-05-18" ...
$ Date_time : Factor w/ 279544 levels "1-03-15 0:00",..: 198302 258702 18684 85140 190788 182641 208718 26315 198759 205744 ...
$ Receiver : Factor w/ 17 levels "uitzetpunt 1-noord",..: 8 5 8 5 6 7 6 8 5 8 ...
$ Station : Factor w/ 17 levels "10","11","12",..: 15 12 15 12 13 14 13 15 12 15 ...
$ Traject : Factor w/ 53 levels "","10-10","10-9",..: 53 50 41 50 40 44 45 53 50 41 ...
$ Interval : num 83.4 12.7 42.6 25.2 217.4 ...
$ Distance : num 1540 6480 6480 6480 4690 4220 4220 1540 6480 6480 ...
$ Min_speed : num 0.02 0.51 0.15 0.26 0.02 0.73 0.52 0.01 0.02 0.02 ...
$ Min_speed2 : num 0.00556 0.14167 0.04167 0.07222 0.00556 ...
$ Length : int 47 47 47 47 47 47 47 45 45 45 ...
$ Activity : chr "Low" "Low" "Low" "Low" ...
$ Moved : chr "Yes" "Yes" "Yes" "Yes" ...
$ Min_disp : num 160 4080 1200 2080 160 5840 4160 80 160 160 ...
$ Max_disp : num 240 6120 1800 3120 240 8760 6240 120 240 240 ...
if() isn't vectorized. It work on a single condition, not a whole vector. That's what the warning "the condition has length > 1 and only the first element will be used" is telling you. You could use if() for this purpose, but you would need to put it in a for loop to check each row one-at-a-time. Doable, but not efficient.
ifelseis a vectorized version of if, and is good for a problem like this. For something like this, you would probably nest 2 ifelses:
Noord_2015_moved$Actual_disp = ifelse(
Noord_2015_moved$Distance < Noord_2015_moved$Min_disp, 0,
ifelse(Noord_2015_moved$Distance < Noord_2015_moved$Max_disp, Noord_2015_moved$Distance,
Noord_2015_moved$Max_disp
))
I see you have a single mutate. If you're using dplyr, you can use mutate which adds a column to the data frame and means you don't need to type out the data frame's name to reference existing columns. This code is equivalent to my above code:
Noord_2015_moved = Noord_2015_moved %>% mutate(
Acutal_disp = ifelse(Distance < Min_disp, 0,
ifelse(Distance < Max_disp, Distance, Max_disp)
)
)
In addition to using to ifelse multiple times, you can use dplyr::case_when, which handles multiple outcomes in the cleanest possible way:
Noord_2015_moved = Noord_2015_moved %>% mutate(
Acutal_disp = case_when(
Distance < Min_disp ~ 0,
Distance < Max_disp ~ Distance,
Distance > Max_disp ~ Max_disp,
TRUE ~ NA_real_
)
)
Here is a short reference.
My dataset look something like given below. The first number is the feature number and then colon and then the value associated with that specific feature. I am not sure how to import this dataset in R. Anyone has any ideas?
236:24 500:163 732:234 869:117 885:106 1249:103 1280:158 1889:119 2015:55 2718:126 3307:137 3578:25 3770:26 4139:128 4723:114 4957:82 5128:50 5420:124 5603:135 5897:34 5946:117 6069:154 6153:55 6347:87 6372:77 6666:109 6866:223 6984:39 7709:253 7950:87 8078:38 8945:141 9316:111 9948:103 9989:68 10276:43 10530:76 10532:55 10799:15 10802:20 10848:82 11347:16 11871:51 11883:105 12534:133 12601:13 12781:178 12798:116 12842:106 12916:7 12935:51 12968:154 13028:58 13330:105 13384:2 13568:47 13641:632 13829:18 13964:62 14385:93 14392:272 15280:140 15424:119 15492:52 15523:31 16311:23 16464:69 16478:94 16584:102 16586:107 16705:272 17138:108 17181:150 17526:280 17540:163 18007:114 18050:53 18180:2 18806:160 18943:73 19055:41 19255:88 19774:59 19889:72 19921:45
101:68 572:57 732:63 962:120 1304:61 1831:60 1889:58 1973:105 2518:161 2629:228 2990:158 3147:75 3578:11 3860:88 4011:18 4623:141 4684:411 4758:69 4820:120 6149:102 6234:134 6306:118 6866:147 6927:89 6988:51 7048:178 7193:31 7257:61 7709:229 8061:125 8202:188 8272:17 8759:165 9104:77 9325:135 9860:97 10055:684 10532:180 10735:64 10744:267 10820:120 10848:186 10923:128 10936:129 11203:160 11303:144 11668:87 11867:97 11871:207 12191:83 12238:193 12380:51 12968:164 13369:58 13929:39 14531:102 14800:130 14931:99 15314:91 15632:62 16165:7 16353:120 16584:137 17216:172 18372:31 18893:75 19133:93 19154:101 19165:133 19607:20 19784:141 19889:97 19921:60
Assuming your data is stored in input.txt,
input <- scan('input.txt', what = 'character')
data <- as.data.frame(matrix(as.numeric(unlist(strsplit(input, ':'))), ncol = 2))
colnames(data) <- c('Feature', 'Value')
str(data)
# 'data.frame': 158 obs. of 2 variables:
# $ Feature: num 236 24 500 163 732 234 869 117 885 106 ...
# $ Value : num 18943 73 19055 41 19255 ...
Alternatively, you can use read.table to parse the input rather than manually splitting the strings which is slightly slower but more readable.
data <- read.table(text = input, sep = ':')
colnames(data) <- c('Feature', 'Value')
str(data)
# 'data.frame': 158 obs. of 2 variables:
# $ Feature: num 236 24 500 163 732 234 869 117 885 106 ...
# $ Value : num 18943 73 19055 41 19255 ...
Edit: adapted for your dataset. Reads your Feature/Value pairs into a data frame.
url <- 'https://archive.ics.uci.edu/ml/machine-learning-databases/dexter/DEXTER/dexter_test.data'
input <- scan(url, what = 'character')
data <- as.data.frame(matrix(as.numeric(unlist(strsplit(input, ':'))), ncol = 2))
colnames(data) <- c('Feature','Value')
str(data)
# 'data.frame': 192449 obs. of 2 variables:
# $ Feature: num 236 24 500 163 732 234 869 117 885 106 ...
# $ Value : num 79 10848 105 11018 76 ...
I'm trying to create a vector with two columns that contain the following strings given that the data in BOTH columns are true. I tried, unsuccessfully with:
CrimesAndLocation <- table(c(Crimes_Data$Primary.Type=="ARSON","ASSAULT","BATTERY","BURGLARY","HOMICIDE","HUMAN TRAFFICKING","KIDNAPPING","ROBBERY",Crimes_Data$Location.Description=="RESIDENCE")))
I'm trying to get an output where:
Primary.Type, is one of the 8 specific felonies listed above. Thus, it should not show all 32 possible felonies, just out of the 8 listed above
Location.Description, is RESIDENCE
This is the goal of what I'm trying to do:
COLUMN 1 COLUMN 2
"ARSON" "RESIDENCE"
"KIDNAPPING" "RESIDENCE"
"BATTERY" "RESIDENCE"
"HOMICIDE" "RESIDENCE"
"ASSAULT" "RESIDENCE"
...
UPDATE: > str(Crimes_Data) :
'data.frame': 293036 obs. of 22 variables:
$ ID : int 10248194 10251162 10248198 10248242 10248228 10248223 10248192 10248157 10249529 10252453 ...
$ Case.Number : Factor w/ 293015 levels "F218264","HA168845",..: 292354 292350 292363 292359 292368 292366 292351 292348 292364 292816 ...
$ Date : Factor w/ 124573 levels "01/01/2015 01:00:00 AM",..: 94544 94542 94539 94536 94535 94535 94535 94535 94529 94528 ...
$ Block : Factor w/ 27983 levels "0000X E 100TH PL",..: 13541 7650 22635 1317 13262 9623 12854 8232 24201 14279 ...
$ IUCR : Factor w/ 334 levels "0110","0130",..: 49 139 321 33 251 82 38 282 97 38 ...
$ Primary.Type : Factor w/ 32 levels "ARSON","ASSAULT",..: 3 7 24 3 18 31 3 13 17 3 ...
$ Description : Factor w/ 313 levels "$500 AND UNDER",..: 111 281 119 35 131 1 260 193 274 260 ...
$ Location.Description: Factor w/ 121 levels "","ABANDONED BUILDING",..: 95 19 110 48 97 110 106 110 110 99 ...
$ Arrest : Factor w/ 2 levels "false","true": 1 1 2 1 2 2 1 2 2 1 ...
$ Domestic : Factor w/ 2 levels "false","true": 2 1 1 1 1 1 1 1 1 1 ...
$ Beat : int 835 333 733 634 1121 1432 1024 735 414 2535 ...
$ District : int 8 3 7 6 11 14 10 7 4 25 ...
$ Ward : int 18 5 6 21 27 1 22 17 7 26 ...
$ Community.Area : int 70 43 68 49 23 22 30 67 46 23 ...
$ FBI.Code : Factor w/ 26 levels "01A","01B","02",..: 11 17 26 6 21 8 11 25 9 11 ...
$ X.Coordinate : int 1154209 1190610 1172166 1176493 1153156 1159961 1154332 1163770 1193570 NA ...
$ Y.Coordinate : int 1852321 1856955 1858813 1841948 1904451 1915955 1887190 1857568 1852889 NA ...
$ Year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
$ Updated.On : Factor w/ 442 levels "01/01/2015 12:39:07 PM",..: 288 288 288 288 288 288 288 288 288 288 ...
$ Latitude : num 41.8 41.8 41.8 41.7 41.9 ...
$ Longitude : num -87.7 -87.6 -87.6 -87.6 -87.7 ...
$ Location : Factor w/ 173646 levels "","(41.644604096, -87.610728247)",..: 31318 40835 45858 15601 116871 140063 84837 42961 32176 1 ...
This is a good job for the dplyr package. The filter function will filter a data frame according to any number of logical expressions that you feed it. The following should work for you:
library(dplyr)
filter(
Crimes_Data,
Primary.Type %in% c("ARSON", "ASSAULT", "BATTERY",
"BURGLARY", "HOMICIDE", "HUMAN TRAFFICKING",
"KIDNAPPING", "ROBBERY"),
Location.Description == "RESIDENCE"
)
If you'd rather not use dplyr, you can do it the old fashioned way with base R, like this:
type.bool <- Crimes_Data$Primary.Type %in% c("ARSON", "ASSAULT", "BATTERY",
"BURGLARY", "HOMICIDE",
"HUMAN TRAFFICKING", "KIDNAPPING",
"ROBBERY")
location.bool <- Crimes_Data$Location.Description == "RESIDENCE"
Crimes_Data[type.bool & location.bool, ]
Instead of an integer vector of indices, the [ subsetting operator can take a boolean vector. In that case, it will only return the rows of the data frame for which the corresponding elements of the boolean vector are TRUE.
Thanks for the str() aka "structure" output update, it makes it clearer to be able to help you.
To obtain a list of observations where
these eight felonies : "ARSON","ASSAULT","BATTERY","BURGLARY","HOMICIDE","HUMAN TRAFFICKING","KIDNAPPING","ROBBERY"
occurred at RESIDENCE
Try breaking up the task into slightly smaller parts:
Step 1:
ViolentCrimes = subset(Crimes_Data, Primary.Type == "ARSON" | Primary.Type == "ASSAULT" | Primary.Type == "BATTERY" | Primary.Type == "BURGLARY" | Primary.Type == "HOMICIDE" | Primary.Type == "HUMAN TRAFFICKING" | Primary.Type == "KIDNAPPING" | Primary.Type == "ROBBERY")
Step 2:
ViolentCrimesResidence = subset(ViolentCrimes, Location.Description == "RESIDENCE", select = c(Primary.Type, Location.Description))
Result:
ViolentCrimesResidence holds two columns with Column 1 being a list of Primary.Type and column 2 is Location.Description, where Column 1 only has values from the eight felonies of interest and column2 only "RESIDENCE"
Explanation
Step 1:
From R website's examples about subset and OR condition:
PineTreeGrade3Data<-subset(StudentData, SchoolName=="Pine Tree Elementary" | Grade==3)
Whereas we have:
ViolentCrimes = subset(Crimes_Data, Primary.Type == "ARSON" |
we use the subset() function
Crimes_Data is the existing data frame as input
next are the conditions. Which simply take the pattern of VectorName == "Some string", in this casePrimary.Type == "ARSON"`
But we want observations for the other types too, so use the "or" condition to include them
in R, "or" is written with | symbol. So we use this repeatedly to include each of the other felonies of interest
the equal sign = is synonymous with <- and assigns, saves this subset result, into to a new data frame we call ViolentCrimes.
note I prefer using = because it is less keystrokes to type than <-, either is correct
Step 2:
ViolentCrimesResidence = subset(ViolentCrimes, Location.Description == "RESIDENCE", select = c(Primary.Type, Location.Description))
the input is ViolentCrimes data frame we made previously which contains only the eight violent crimes , the eight felonies "ARSON", "ASSAULT"...
now we are interested in, out of all these violent crimes, which ones occurred at home, so use condition Location.Description == "RESIDENCE"
but a further option of subset() we didn't use before, is the select = ... option
we do a select = c(Variable1, Variable2) to choose just the Primary.Type and Location.Description vectors
note that if you actually don't want to limit to the columns aka Variables, you simply omit this , select ... option
thus it saves this new subset into ViolentCrimesResidence
So, now in R when you:
ViolentCrimesResidence
You will see a two-column output you wanted of the eight felonies of interest, that happened in RESIDENCE.