This is the code that I was using for my data mining assignment in R studio. I was preprocessing the data.
setwd('C:/Users/user/OneDrive/assignments/Data mining/individual')
dataset = read.csv('Dataset.csv')
dataset[dataset == '?'] <- NA
View(dataset)
x <- na.omit(dataset)
library(tidyr)
library(dplyr)
library(outliers)
View(gather(x))
x$Age[x$Age <= 30] <- 3
x$Age[(x$Age <=49) & (x$Age >= 31)] <- 2
x$Age[(x$Age != 3) & (x$Age !=2)] <- 1
x$Hours_Per_week[x$Hours_Per_week <= 30] <- 3
x$Hours_Per_week[(x$Hours_Per_week <= 49)& (x$Hours_Per_week >= 31)] <- 2
x$Hours_Per_week[(x$Hours_Per_week != 3) & (x$Hours_Per_week != 2)] <- 1
x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov","Local-
gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov"), labels =
c(1,2,3,4,5,6) )
And here by I will attach the result of the code.
the result
str(x)
As you can see in the result , after the last code , all the data in the column Hours_Per_week is suddenly changed into NA. I don't really know why this occurs since every other example that I saw online changed the data inside to the labels.
The link for the dataset :
dataset
unfortunately I do not know the original data - possibly you just have to change the levels and labels content:
x$Work_Class <- factor(x$Work_Class, levels = c(1,2,3,4,5,6), labels = c("Federal-gov","Local-gov","Private","Self-emp-inc","Self-emp-not-inc","State-gov") )
The problem is the factor() statement. The Dataset.csv file does not have character strings surrounded by quotation marks so you get a leading space on every character field.
str(dataset)
# data.frame': 100 obs. of 7 variables:
# $ Age : int 39 50 38 53 28 37 49 52 31 42 ...
# $ Work_Class : chr " State-gov" " Self-emp-not-inc" " Private" NA ...
# $ Education : chr " Bachelors" " Bachelors" " HS-grad" " 11th" ...
# $ Marital_Status: chr " Never-married" " Married-civ-spouse" " Divorced" " Married-civ-spouse" ...
# $ Sex : chr " Male" " Male" " Male" " Male" ...
# $ Hours_Per_week: int 40 13 40 40 40 40 16 45 50 40 ...
# $ Income : chr " <=50K" " <=50K" " <=50K" " <=50K" ...
Notice the blank space before each label in Work_Class, Education, Marital_Status, Sex, and Income. You need to trim the white space when you read the file:
dataset = read.csv('Dataset.csv', strip.white=TRUE)
Then change the last line by removing the labels= argument:
x$Work_Class <- factor(x$Work_Class, levels = c("Federal-gov", "Local-gov", "Private", "Self-emp-inc", "Self-emp-not-inc", "State-gov"))
str(x)
# 'data.frame': 93 obs. of 7 variables:
# $ Age : num 2 1 2 3 2 2 1 2 2 3 ...
# $ Work_Class : Factor w/ 6 levels "Federal-gov",..: 6 5 3 3 3 3 5 3 3 6 ...
# $ Education : chr "Bachelors" "Bachelors" "HS-grad" "Bachelors" ...
# $ Marital_Status: chr "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
# $ Sex : chr "Male" "Male" "Male" "Female" ...
# $ Hours_Per_week: num 2 3 2 2 2 3 2 2 1 2 ...
# $ Income : chr "<=50K" "<=50K" "<=50K" "<=50K" ...
# - attr(*, "na.action")= 'omit' Named int [1:7] 4 9 28 62 70 78 93
# ..- attr(*, "names")= chr [1:7] "4" "9" "28" "62" ...
table(x$Work_Class)
#
# Federal-gov Local-gov Private Self-emp-inc Self-emp-not-inc State-gov
# 6 6 67 3 7 4
Related
I have this data frame which ended up all as characters. I need to convert the Date column to a date format and the rest as numeric.
> df <- data.frame(Date = c("1996-01-01", "1996-01-05", "1996-01-29"),
+ SD = c("11", "12", "13"),
+ SF = c("624", "625", "626"),
+ LA = c("1", "2", "3"),
+ IR = c("107", "108", "109"))
> df
Date SD SF LA IR
1 1996-01-01 11 624 1 107
2 1996-01-05 12 625 2 108
3 1996-01-29 13 626 3 109
> str(df)
'data.frame': 3 obs. of 5 variables:
$ Date: chr "1996-01-01" "1996-01-05" "1996-01-29"
$ SD : chr "11" "12" "13"
$ SF : chr "624" "625" "626"
$ LA : chr "1" "2" "3"
$ IR : chr "107" "108" "109"
Tried this to convert only columns 2:5 but ended with Date as num and coerced to "NA".
> df$Date <- as.Date(df$Date)
> df2 <- df
> columns <- c(1, 2:5)
> df2[ , columns] <- apply(df[ , columns], 2, function(x) as.numeric(x))
Warning message:
In FUN(newX[, i], ...) : NAs introduced by coercion
> df2
Date SD SF LA IR
1 NA 11 624 1 107
2 NA 12 625 2 108
3 NA 13 626 3 109
> str(df2)
'data.frame': 3 obs. of 5 variables:
$ Date: num NA NA NA
$ SD : num 11 12 13
$ SF : num 624 625 626
$ LA : num 1 2 3
$ IR : num 107 108 109
Any ideas where I got it wrong or any ideas how I can do this better?
Thanks in advance.
For this I would suggest using type.convert() on the whole data.frame, and then use as.Date() on the Date column.
Use the as.is = TRUE argument to ensure strings (your dates) are not converted to factors.
df <- data.frame(
Date = c("1996-01-01", "1996-01-05", "1996-01-29"),
SD = c("11", "12", "13"),
SF = c("624", "625", "626"),
LA = c("1", "2", "3"),
IR = c("107", "108", "109")
)
str(df)
#> 'data.frame': 3 obs. of 5 variables:
#> $ Date: chr "1996-01-01" "1996-01-05" "1996-01-29"
#> $ SD : chr "11" "12" "13"
#> $ SF : chr "624" "625" "626"
#> $ LA : chr "1" "2" "3"
#> $ IR : chr "107" "108" "109"
df2 <- type.convert(df, as.is = TRUE)
str(df2)
#> 'data.frame': 3 obs. of 5 variables:
#> $ Date: chr "1996-01-01" "1996-01-05" "1996-01-29"
#> $ SD : int 11 12 13
#> $ SF : int 624 625 626
#> $ LA : int 1 2 3
#> $ IR : int 107 108 109
df2$Date <- as.Date(df2$Date)
str(df2)
#> 'data.frame': 3 obs. of 5 variables:
#> $ Date: Date, format: "1996-01-01" "1996-01-05" ...
#> $ SD : int 11 12 13
#> $ SF : int 624 625 626
#> $ LA : int 1 2 3
#> $ IR : int 107 108 109
Currently your logic is including all columns:
columns <- c(1, 2:5) # same as c(1:5)
But you want to exclude the first column of dates, so use this version:
columns <- c(2:5)
df2[ , columns] <- apply(df[ , columns], 2, function(x) as.numeric(x))
I would like to show the data structure of a dataframe/Tibble BUT Without the attributes - attr(*, "spec")= at the end. Is there another command (or better way) that shows only lines 2 & 3?
## spec_tbl_df [4,238 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ male : Factor w/ 2 levels "F","M": 2 1 2 1 1 1 1 1 2 2 ...
## $ age : num [1:4238] 39 46 48 61 46 43 63 45 52 43 ...
## - attr(*, "spec")=
## .. cols(
## .. male = col_factor(levels = c("0", "1"), ordered = FALSE, include_na = FALSE),
## .. age = col_double(),
## .. )
## - attr(*, "problems")=<externalptr>
We can use give.attr=FALSE. Example:
test <- `attr<-`(data.frame(), 'foo', 'bar')
str(test)
# 'data.frame': 0 obs. of 0 variables
# - attr(*, "foo")= chr "bar"
str(test, give.attr=FALSE)
# 'data.frame': 0 obs. of 0 variables
This is the function i have written
factor_fun <- function(data1,vec){
for(i in vec){
data1[,i] = as.factor(data1[,i])
}
}
x = c(3,5,7,8,9,12,15,17,18,19,21)
factor_fun(train_data,x)
str(train_data)
This is the result
> factor_fun(train_data,x)
> str(train_data)
'data.frame': 13645 obs. of 22 variables:
$ EmpID : int 11041 15079 18638 3941 5936 9670 16554 3301 12236 10157 ...
$ EmpName : chr "John" "William" "James" "Charles" ...
$ LanguageOfCommunication: chr "English" "English" "English" "English" ...
$ Age : int 35 26 36 29 25 35 31 32 28 31 ...
$ Gender : chr "Male" "Male" "Female" "Female" ...
$ JobProfileIDApplyingFor: chr "JR85289" "JR87525" "JR87525" "JR87525" ...
$ HighestDegree : chr "B.Tech" "B.Tech" "PhD" "BCA" ...
$ DegreeBranch : chr "Electrical" "Artificial Intelligence" "Computer Science" "Information Technology" ...
$ GraduatingInstitute : chr "Tier 1" "Tier 3" "Tier 1" "Tier 2" ...
$ LatestDegreeCGPA : int 7 7 6 5 8 9 7 8 6 8 ...
$ YearsOfExperince : int 12 3 6 6 2 12 1 9 2 8 ...
$ GraduationYear : int 2009 2018 2015 2015 2019 2009 2020 2012 2019 2013 ...
$ CurrentCTC : int 21 15 15 16 24 25 12 7 21 21 ...
$ ExpectedCTC : int 26 19 24 24 32 29 21 17 28 31 ...
$ MartialStatus : chr "Married" "Married" "Single" "Married" ...
$ EmpScore : int 5 5 5 5 5 4 3 3 4 3 ...
$ CurrentDesignation : chr "SSE" "BA" "SDE" "SDE" ...
$ CurrentCompanyType : chr "Enterprise" "MidSized" "MidSized" "Startup" ...
$ DepartmentInCompany : chr "Design" "Engineering" "Engineering" "Product" ...
$ TotalLeavesTaken : int 20 6 19 16 10 10 8 18 7 10 ...
$ BiasInfluentialFactor : chr "YearsOfExperince" "" "Gender" "Gender" ...
$ FitmentPercent : num 95.4 67.1 91.3 72.3 86.3 ...
so i have written a function where it takes a vector and data and convert the vector matched columns into factor...when i ran it on my dataset itsnt converting columns into factors...can someone help me in this...and i know that we can use lapply or other functions...but it will be better if someone can explain me why this isnt working...thanks..this is my first question on stackoverflow..
It looks like you are expecting the function to change the object that you are passing to it in the parent environment. This is fundamentally not how R works.
One workaround would be to return data1 at the end of your function and assign it when called:
factor_fun <- function(data1,vec){
for(i in vec){
data1[,i] <- as.factor(data1[,i])
}
return(data1)
}
new_df <- factor_fun(df, 1:2)
Better yet, you could skip the for loop altogether, e. g. with the dplyr package:
factor_fun <- function(data, cols) {
dplyr::mutate(data, across(all_of(cols), as.factor))
}
new_df <- factor_fun(df, 1:2)
I have two datasets, one has all numeric, while the other has the conversion:
This is an example of the live dataset, it has more than 2 column, but in this instance 191 is USA
This is an example of the conversion dataset. Note that sometimes the lengths of the columns are different. For instance, country has 200 elements, while Ethnicity has 6.
How can I write code that will change the value in countryofbirth to US - 191 transformed to USA? The original dataset is called my_data, while the conversion dataset is called Coversion.
I have tried left_joins,
my_data <- left_join(my_data, select(Conversion, c("StateGrewUpIn", "State")), by = "StateGrewUpIn")
I have tried merge:
my_data <- merge(x = my_data, y = Conversion[ , c("StateGrewUpIn", "State")], by = "StateGrewUpIn", all.x=TRUE)
Nothing seems to work, it always duplicates the rows beyond the maximum in the original dataset. In other words, rather than just doing a clean vlookup, it is duplicating rows, or including rows from the Conversion dataset.
Conversion Dataset
> str(Conversion)
'data.frame': 200 obs. of 20 variables:
$ Religion : int 1 2 3 4 5 6 7 8 NA NA ...
$ Rel : chr "Protestant" "Catholic" "Islam" "Judaism" ...
$ PoliticalView : int 1 2 3 4 5 6 7 NA NA NA ...
$ Political_views : chr "Very Progressive/Left-wing" "Progressive/Left-wing" "Somewhat Progressive/Left-wing" "Moderate/Centrist" ...
$ CountryofBirth : int 1 2 3 4 5 6 7 8 9 10 ...
$ Country : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ Citizenship : int 1 2 3 4 5 6 7 8 9 10 ...
$ Citizen : chr "Afghanistan" "Albania" "Algeria" "Andorra" ...
$ StateGrewUpIn : int 1 2 3 4 5 6 7 8 9 10 ...
$ State : chr "Alaska (AK)" "American Samoa (AS)" "Arizona (AZ)" "Arkansas (AR)" ...
$ Ethnicity : int 1 2 3 4 5 6 NA NA NA NA ...
$ Ethnic : chr "White" "Asian" "Latino" "Black" ...
$ Education : int 1 2 3 4 5 6 NA NA NA NA ...
$ Education_level : chr "Some high school/secondary school" "High school degree/completed secondary school" "Some university" "University degree" ...
$ YearlyIncome : int 1 2 3 4 5 6 7 NA NA NA ...
$ Income : chr "Less than $10,000 USD a year" "USD $10,000-$20,000" "USD $20,000-$40,000" "USD USD $40,000-$60,000" ...
$ HighestEducationPar : int 1 2 3 4 5 6 NA NA NA NA ...
$ Parent_Highest_Education: chr "Some high school/secondary school" "High school degree/completed secondary school" "Some university" "University degree" ...
$ Atten_check_ans_1 : int 1 2 3 4 5 NA NA NA NA NA ...
$ Attention_Check : chr "strongly disagree" "moderately disagree" "neither disagree nor agree" "moderately agree" ...
Example of large dataset (note it has >200 columns so I just took the mentioned example above).
> str(my_data)
'data.frame': 35 obs. of 228 variables:
$ Citizenship : chr "144" "191" "191" "191" ...
$ CountryofBirth : chr "144" "191" "191" "191" ...
$ StartDate : chr "2019-05-17 13:49:35" "2019-05-17 12:54:30" "2019-05-17 12:54:40" "2019-05-17 12:54:20" ...
$ EndDate : chr "2019-05-17 14:00:12" "2019-05-17 13:00:21" "2019-05-17 13:02:02" "2019-05-17 13:04:25" ...
$ Status : chr "0" "0" "0" "0" ...
I've imported a dataset into R where in a column which should be supposed to contain numeric values are present NULL. This make R set the column class to character or factor depending on if you are using or not the stringAsFactors argument.
To give you and idea this is the structure of the dataset.
> str(data)
'data.frame': 1016 obs. of 10 variables:
$ Date : Date, format: "2014-01-01" "2014-01-01" "2014-01-01" "2014-01-01" ...
$ Name : chr "Chi" "Chi" "Chi" "Chi" ...
$ Impressions: chr "229097" "3323" "70171" "1359" ...
$ Revenue : num 533.78 11.62 346.16 3.36 1282.28 ...
$ Clicks : num 472 13 369 1 963 161 1 7 317 21 ...
$ CTR : chr "0.21" "0.39" "0.53" "0.07" ...
$ PCC : chr "32" "2" "18" "0" ...
$ PCOV : chr "3470.52" "94.97" "2176.95" "0" ...
$ PCROI : chr "6.5" "8.17" "6.29" "NULL" ...
$ Dimension : Factor w/ 11 levels "100x72","1200x627",..: 1 3 4 5 7 8 9 10 11 1 ...
I would like to transform the PCROI column as numeric, but containing NULLs it makes this harder.
I've tried to get around the issue setting the value 0 to all observations where current value is NULL, but I got the following error message:
> data$PCROI[which(data$PCROI == "NULL"), ] <- 0
Error in data$PCROI[which(data$PCROI == "NULL"), ] <- 0 :
incorrect number of subscripts on matrix
My idea was to change to 0 all the NULL observations and afterwards transform all the column to numeric using the as.numeric function.
You have a syntax error:
data$PCROI[which(data$PCROI == "NULL"), ] <- 0 # will not work
data$PCROI[which(data$PCROI == "NULL")] <- 0 # will work
by the way you can say:
data$PCROI = as.numeric(data$PCROI)
it will convert your "NULL" to NA automatically.