R is not reading my csv file properly? - r

I am new to R Programming language. I can able to load this csv file into R. it was semicolon separated csv file. There are totally 33 attributes. but R reads it as 1 column, The link for dataset is
https://archive.ics.uci.edu/ml/datasets/Student+Performance#
I had tried using sep(";") while reading csv file in R. i had also tried to convert various formats from csv to text,dif, and nothing works.
Your help is appreciated.

df1 <- read.csv('student-mat.csv', sep=";")
df2 <- read.csv('student-por.csv', sep=";")
Works without any problem. Maybe you had just forgotten to put the equal sign between sep and ";".
> str(df1)
'data.frame': 395 obs. of 33 variables:
$ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
$ age : int 18 17 15 15 16 16 16 17 15 15 ...
$ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
$ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 .
....

Using your data, it works well for me:
student_por <- read.csv2("student-por.csv")
student_mat <- read.csv2("student-mat.csv")

Related

Using lapply in group_by() of two factors in R

I have this data frame (named as OEM_final). This is the structure:
str(OEM_final)
'data.frame': 2265 obs. of 17 variables:
$ dia_hora_OEM : POSIXct, format: "2019-12-31 06:40:13" "2019-12-31 06:43:00" "2019-12-31 07:11:30" "2019-12-31 07:18:30" ...
$ coche_OEM : Factor w/ 6 levels "356232050832996",..: 3 3 3 3 3 3 3 3 6 6 ...
$ DTC_OEM_dec64: chr "[{\"code\":\"B1182\",\"description\":\"Tire pressure monitor module\",\"faultInformations\":[{\"description\":\"| __truncated__ "[{\"code\":\"B1182\",\"description\":\"Tire pressure monitor module\",\"faultInformations\":[{\"description\":\"| __truncated__ "[{\"code\":\"B1182\",\"description\":\"Tire pressure monitor module\",\"faultInformations\":[{\"description\":\"| __truncated__ "[{\"code\":\"B1182\",\"description\":\"Tire pressure monitor module\",\"faultInformations\":[{\"description\":\"| __truncated__ ...
$ rowname : Factor w/ 2265 levels "1","10","100",..: 1 1112 1489 1600 1711 1822 1933 2044 2155 2 ...
$ B1182 : Factor w/ 2 levels "B1182","NULL": 1 1 1 1 1 1 1 1 2 2 ...
$ B124D : Factor w/ 2 levels "B124D","NULL": 1 1 1 1 1 1 1 1 2 2 ...
$ NA. : Factor w/ 6 levels "c(NA, NA, NA, NA, NA, NA, NA, NA)",..: 3 3 3 3 3 3 3 3 1 1 ...
$ P2000 : Factor w/ 2 levels "c(\"P2000\", \"P2000\", \"P2000\")",..: 2 2 2 2 2 2 2 2 2 2 ...
$ U3003 : Factor w/ 2 levels "NULL","U3003": 1 1 1 1 1 1 1 1 1 1 ...
$ B1D01 : Factor w/ 3 levels "B1D01","c(\"B1D01\", \"B1D01\")",..: 3 3 3 3 3 3 3 3 3 3 ...
$ U0155 : Factor w/ 2 levels "NULL","U0155": 1 1 1 1 1 1 1 1 1 1 ...
$ C1B00 : Factor w/ 2 levels "C1B00","NULL": 2 2 2 2 2 2 2 2 2 2 ...
$ P037D : Factor w/ 2 levels "NULL","P037D": 1 1 1 1 1 1 1 1 1 1 ...
$ P0616 : Factor w/ 2 levels "NULL","P0616": 1 1 1 1 1 1 1 1 1 1 ...
$ P0562 : Factor w/ 2 levels "NULL","P0562": 1 1 1 1 1 1 1 1 1 1 ...
$ U0073 : Factor w/ 2 levels "NULL","U0073": 1 1 1 1 1 1 1 1 1 1 ...
$ P0138 : Factor w/ 2 levels "c(\"P0138\", \"P0138\", \"P0138\")",..: 2 2 2 2 2 2 2 2 2 2 ...
I would like to calculate the earlier date (dia_hora_OEM) that appears when grouping by two factors. The two factors are:
One of this factor, which is common in all the possible combinations, is coche_OEM.
The other one is one from column 8 (P2000) to the last one (P0138), one at a time.
So, the group_by() would be:
group_by(coche_OEM, P2000)
group_by(coche_OEM, U3003)
group_by(coche_OEM, B1D01)
group_by(coche_OEM, U0155)
...
I tried different ways to accomplish this:
Using for loops:
for (DTC in c(U3003, P2000)) {
OEM_final %>%
group_by(DTC, coche_OEM) %>%
filter(dia_hora_OEM == min(dia_hora_OEM))
}
But I get an error saying:
Error in c(U3003, P2000) : object 'U3003' not found
Using lapply
In this case, I created a function:
groupCombDTC <- function(x) {
OEM_final %>%
group_by(coche_OEM, x) %>%
filter(dia_hora_OEM == min(dia_hora_OEM))
}
And then I ran lapply():
lapply(colnames(OEM_final)[8:17], groupCombDTC)
I get this error:
Error: Column `x` is unknown
Can anybody help me iterating in different combinations using group_by()?
That's a standard problem of standard evaluation with dplyr. dplyr is based on non standard evaluation so quoted arguments need to be unquoted.
Several solutions exist. This one works well
groupCombDTC <- function(x) {
OEM_final %>%
group_by(coche_OEM, !!rlang::sym(x)) %>%
filter(dia_hora_OEM == min(dia_hora_OEM))
}
It requires to use together !! and rlang::sym to unquote and evaluate your variable name.
Column names as arguments are easier to handle with data.table. If you want more elements regarding SE/NSE in dplyr and data.table, you can have a look at a blog post I wrote a few days ago

Complex Counting of dataframe with factors

I have this table.
'data.frame': 5303 obs. of 9 variables:
$ Metric.ID : num 7156 7220 7220 7220 7220 ...
$ Metric.Name : Factor w/ 99 levels "Avoid accessing data by using the position and length",..: 51 59 59
$ Technical.Criterion: Factor w/ 25 levels "Architecture - Multi-Layers and Data Access",..: 4 9 9 9 9 9 9 9 9 9 ...
$ RT.Snapshot.name : Factor w/ 1 level "2017_RT12": 1 1 1 1 1 1 1 1 1 1 ...
$ Violation.status : Factor w/ 2 levels "Added","Deleted": 2 1 2 2 2 1 1 1 1 1 ...
$ Critical.Y.N : num 0 0 0 0 0 0 0 0 0 0 ...
$ Grouping : Factor w/ 29 levels "281","Bes",..: 27 6 6 6 6 7 7 7 7 7 ...
$ Object.type : Factor w/ 11 levels "Cobol Program",..: 8 7 7 7 7 7 7 7 7 7 ...
$ Object.name : Factor w/ 3771 levels "[S:\\SOURCES\\",..: 3771 3770 3769 3768 3767 3
I want to have a statistic output like this:
For every Technical.Criterion a row with the sum of all rows of Critical.Y.N = 0 and 1
So I have to combine the rows of my database to a new matrix. Using Values of the factor sums ...
But I have no idea how to start...? Any hints?
Thanks
I believe you're asking for a cross-tabulation. Because you did not provide a reproducible sample, I've used mine:
xtabs(~ Sub.Category + Category, retail)
Produce this:
And if you want the value to be say, based on Sales, instead of the count, then you can modify the code to:
xtabs(Sales ~ Sub.Category + Category, retail)
And you will get the following output:
EDIT based on extra information in the OP's comment
If you want to have your tables also share a common title and want to change the name of that title, you can use a combination of names() and dimnames(). An xtab is a cross-tabulation table and if you call dimnames() on it it returns a list of length 2, first one corresponding to the row and second to the column.
dimnames(xtab(dat))
$Technical.Criterion
[1] "TechnicalCrit1" "TechnicalCrit2" "TechnicalCrit3"
$`Object.type`
[1] "Object.type1" "Object.type2" "Object.type3"
So given a data frame, b:
'data.frame': 3 obs. of 9 variables:
$ Metric.ID : int 101 102 103
$ Metric.Name : Factor w/ 3 levels "A","B","C": 1 2 3
$ Technical.Criterion: Factor w/ 3 levels "TechnicalCrit1",..: 1 2 3
$ RT.Snapshot.name : Factor w/ 3 levels "A","B","C": 1 2 3
$ Violation.status : Factor w/ 2 levels "Added","Deleted": 1 2 1
$ Critical.Y.N : num 1 0 1
$ Grouping : Factor w/ 3 levels "A","B","C": 1 2 3
$ Object.type : Factor w/ 3 levels "Object.type1",..: 1 2 3
$ Object.name : Factor w/ 3 levels "A","B","C": 1 2 3
We can use xtab and then change the "common" header right at the top of our table. Since I don't know how many levels are in b$Violation.status, I would use a generic for loop:
for(i in 1:length(unique(b$Violation.status))){
tab[[i]] <- xtabs(Critical.Y.N ~ Technical.Criterion + Object.type, b)
names(dimnames(tab[[i]]))[2] <- paste("Violation.status", i)
}
This produces:
Violation.status 1
Technical.Criterion Object.type1 Object.type2 Object.type3
TechnicalCrit1 1 0 0
TechnicalCrit2 0 0 0
TechnicalCrit3 0 0 1
Which I can now use in my shiny app.

Compare one csv file to multiple csv files and write new csv files R

I am pretty new to loops in R so I do apologies if this question has been asked elsewhere.
Read in all 30 CSVfiles -> Compare File A species to the other 30 CSV files by species -> Write a new CSV file for each of the 30 files with just the matching species
File A has one column with the names of 190 species ($name). The 30 other csv files each have a column with the species ($SBSname) with differing number of species in the column $SBSname that can range from 100-500 with replicates (so the file CSV file can be larger than 190 rows). However I don't know how to write the code that ...
This is all I have at the moment ...
I have looped in all the CSV files:
30files = list.files(pattern="*.csv")
for (i in 1:length(30files)) assign(30files[i], read.csv(30files[i]))
I have code for just comparing one CSV file (branching.csv) against File A:
> str(FileA)
'data.frame': **190 obs. of 1 variable**:
$ name: Factor w/ 190 levels "Acaena novae zelandiae",..: 1 2 3 4 5 6 7 8 9 10 ...
> str(branching.csv)
'data.frame': **4055 obs. of 7 variables:**
$ SBSname : Factor w/ 2877 levels "Abies alba","Abies nordmanniana",..: 794 2075 1049 162 132 333 541 1840 272 1553 ...
$ SBS.number : int 16443 26711 40171 40398 40867 41151 37871 42412 35847 36245 ...
$ general.method : Factor w/ 5 levels "derivation from morphologies or other plant traits",..: 3 1 2 2 2 2 2 2 2 2 ...
$ branching : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 1 1 1 1 ...
$ valid : int 1 1 1 1 1 1 1 1 1 1 ...
$ reference : Factor w/ 6 levels "Barkman, J.J.(1988): New systems of plant growth forms and phenological plant types",..: 1 1 3 3 3 3 3 3 3 3 ...
$ original.reference: Factor w/ 97 levels "Aarssen, L.W. (1981): The biology of Canadian weeds. 50. Hypochoeris radicata L.",..: 9 9 20 3 3 3 3 3 33 33 ...
Species<-branching.csv[(branching.csv$SBSname %in% FileA$name),]
write.csv(Species, file = "Branching.csv")
> str(Species)
'data.frame': **298 obs. of 7 variables:**
$ name : Factor w/ 2877 levels "Abies alba","Abies nordmanniana",..: 1049 162 1548 47 57 1647 1060 2788 2094 1976 ...
$ SBS.number : int 40171 40398 36280 40532 41629 42495 40103 32792 32892 30583 ...
$ general.method : Factor w/ 5 levels "derivation from morphologies or other plant traits",..: 2 2 2 2 2 2 2 2 2 2 ...
$ branching : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 2 ...
$ valid : int 1 1 1 1 1 1 1 1 1 1 ...
$ reference : Factor w/ 6 levels "Barkman, J.J.(1988): New systems of plant growth forms and phenological plant types",..: 3 3 3 3 3 3 3 3 3 3 ...
$ original.reference: Factor w/ 97 levels "Aarssen, L.W. (1981): The biology of Canadian weeds. 50. Hypochoeris radicata L.",..: 20 3 33 33 33 33 33 44 44 44 ...
Any help or suggestions would be great. Doesn't have to be a loop!
How about this simple loop?
library(dplyr)
for(i in 1:length(30files))
{
csv.matching = read.csv(30files[i]) %>% inner_join(FileA, by=c("SBSname"="name"))
write.csv(csv.matching, file=gsub("\\.csv", "_matchin.csv", 30files[i]), na="")
}

How to reduce a data set after subsetting it with gsub

I have a large data set, which I reduced applying gsub multiple times, basically in this form:
levels(Orders$Im) <- gsub("Osp", "OsProf", levels(Orders$Im))
I also used rbind:
DI_Reduced <- rbind(CX, OsP)
I need to apply function "tree" to the resulting data.frame, but I get an error:
tree.model <- tree(line ~ CountryCode + OrderType + Support, data=train.set)
The error is:
Error in tree(line ~ CountryCode + OrderType + Support, :
factor predictors must have at most 32 levels
Strange thing: if I export the train.set with write.csv and then I re-import it with read.csv, the error disappears and the tree is built.
I investigated the structure of the train.set and this is the difference before and after exporting/importing it:
$ CustomerNumber: Factor w/ 4616 levels "0","101959","210285",..: 3070 3069 4539 3732 2573 3086 2973 3817 4056 2956 ...
$ CountryCode : Factor w/ 4 levels "OtherCountry",..: 3 3 4 4 3 3 3 4 4 3 ...
$ OrderType : Factor w/ 5 levels "Order","NewOrder",..: 5 5 5 5 5 5 5 5 5 5 ...
$ Support : Factor w/ 5 levels "#N/A","BN",..: 4 4 4 4 2 4 4 4 4 4 ...
$ Manuf : Factor w/ 163 levels "<Generic>","6gi",..: 52 52 52 52 52 52 52 52 52 52 ...
$ line : Factor w/ 623 levels "\"Generic\" Skews",..: 400 35 400 400 400 400 400 400 400 400 ...
________________________________________________________________
$ CustomerNumber: Factor w/ 692 levels "201500","20202",..: 361 360 680 499 138 367 315 523 592 304 ...
$ CountryCode : Factor w/ 2 levels "JP","US": 1 1 2 2 1 1 1 2 2 1 ...
$ OrderType : Factor w/ 1 level "Online": 1 1 1 1 1 1 1 1 1 1 ...
$ Support : Factor w/ 4 levels "BN","MC",..: 3 3 3 3 1 3 3 3 3 3 ...
$ Manuf : Factor w/ 1 level "DY": 1 1 1 1 1 1 1 1 1 1 ...
$ line : Factor w/ 2 levels "CX","OTX": 2 1 2 2 2 2 2 2 2 2 ...
It seems to me that gsub does not really subsect the original data.frame, and the hidden values stay in the train.set till I export/import the train. Is there another way to do this operation and obtain a tree?
As the error says, your dependent variable line has more than 32 levels. As per your train.set structure line : Factor w/ 623 levels
Try using other tree libraries like rpart.
Refactoring after subset might help.
sapply(train.set, {function(x) if(class(x) == "factor") {factor(x)}})
Also, gsub is not used for subsetting usually. It is global substitution function. You should share the pre-processing steps followed as well to help others help you with this better.

R factor - time series conversion not working

I have the common problem of converting a factor of the format:
"2007/01"
to time series object. The data can be found here: http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=prc_hicp_midx&lang=en
I did replace the "M" in YYYY"M"MM with a "/".
> str(infl)
'data.frame': 3560 obs. of 5 variables:
$ TIME : Factor w/ 89 levels "2007/01","2007/02",..: 1 2 3 4 5 6 7 8 9 10 ...
$ GEO : Factor w/ 40 levels "Austria","Belgium",..: 15 15 15 15 15 15 15 15 15 15 ...
$ INFOTYPE: Factor w/ 1 level "Index, 2005=100": 1 1 1 1 1 1 1 1 1 1 ...
$ COICOP : Factor w/ 1 level "All-items HICP": 1 1 1 1 1 1 1 1 1 1 ...
$ Value : Factor w/ 1952 levels ":","100.49","100.5",..: 35 51 85 112 127 131 120 126 147 169 ...
I followed all the different approaches:
as.POSIXct(as.character(infl$TIME), format = "%Y/%m")
as.POSIXlt(as.character(infl$TIME), format = "%Y/%m")
as.Date(as.character(infl$TIME), format = "%Y/%m")
However all of them return "NA" for the entire length of the series. Has anyone any idea why I cannot convert this series to a ts object?
Your help is greatly appreciated.
It looks like you can make it work using the yearmon object from the zoo package:
library(zoo)
as.yearmon("2007/01", "%Y/%m")
# [1] "Jan 2007"
See Sorting an data frame based on month-year time format for more ideas.

Resources