R factor - time series conversion not working - r

I have the common problem of converting a factor of the format:
"2007/01"
to time series object. The data can be found here: http://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=prc_hicp_midx&lang=en
I did replace the "M" in YYYY"M"MM with a "/".
> str(infl)
'data.frame': 3560 obs. of 5 variables:
$ TIME : Factor w/ 89 levels "2007/01","2007/02",..: 1 2 3 4 5 6 7 8 9 10 ...
$ GEO : Factor w/ 40 levels "Austria","Belgium",..: 15 15 15 15 15 15 15 15 15 15 ...
$ INFOTYPE: Factor w/ 1 level "Index, 2005=100": 1 1 1 1 1 1 1 1 1 1 ...
$ COICOP : Factor w/ 1 level "All-items HICP": 1 1 1 1 1 1 1 1 1 1 ...
$ Value : Factor w/ 1952 levels ":","100.49","100.5",..: 35 51 85 112 127 131 120 126 147 169 ...
I followed all the different approaches:
as.POSIXct(as.character(infl$TIME), format = "%Y/%m")
as.POSIXlt(as.character(infl$TIME), format = "%Y/%m")
as.Date(as.character(infl$TIME), format = "%Y/%m")
However all of them return "NA" for the entire length of the series. Has anyone any idea why I cannot convert this series to a ts object?
Your help is greatly appreciated.

It looks like you can make it work using the yearmon object from the zoo package:
library(zoo)
as.yearmon("2007/01", "%Y/%m")
# [1] "Jan 2007"
See Sorting an data frame based on month-year time format for more ideas.

Related

Compare one csv file to multiple csv files and write new csv files R

I am pretty new to loops in R so I do apologies if this question has been asked elsewhere.
Read in all 30 CSVfiles -> Compare File A species to the other 30 CSV files by species -> Write a new CSV file for each of the 30 files with just the matching species
File A has one column with the names of 190 species ($name). The 30 other csv files each have a column with the species ($SBSname) with differing number of species in the column $SBSname that can range from 100-500 with replicates (so the file CSV file can be larger than 190 rows). However I don't know how to write the code that ...
This is all I have at the moment ...
I have looped in all the CSV files:
30files = list.files(pattern="*.csv")
for (i in 1:length(30files)) assign(30files[i], read.csv(30files[i]))
I have code for just comparing one CSV file (branching.csv) against File A:
> str(FileA)
'data.frame': **190 obs. of 1 variable**:
$ name: Factor w/ 190 levels "Acaena novae zelandiae",..: 1 2 3 4 5 6 7 8 9 10 ...
> str(branching.csv)
'data.frame': **4055 obs. of 7 variables:**
$ SBSname : Factor w/ 2877 levels "Abies alba","Abies nordmanniana",..: 794 2075 1049 162 132 333 541 1840 272 1553 ...
$ SBS.number : int 16443 26711 40171 40398 40867 41151 37871 42412 35847 36245 ...
$ general.method : Factor w/ 5 levels "derivation from morphologies or other plant traits",..: 3 1 2 2 2 2 2 2 2 2 ...
$ branching : Factor w/ 2 levels "no","yes": 2 2 1 1 1 1 1 1 1 1 ...
$ valid : int 1 1 1 1 1 1 1 1 1 1 ...
$ reference : Factor w/ 6 levels "Barkman, J.J.(1988): New systems of plant growth forms and phenological plant types",..: 1 1 3 3 3 3 3 3 3 3 ...
$ original.reference: Factor w/ 97 levels "Aarssen, L.W. (1981): The biology of Canadian weeds. 50. Hypochoeris radicata L.",..: 9 9 20 3 3 3 3 3 33 33 ...
Species<-branching.csv[(branching.csv$SBSname %in% FileA$name),]
write.csv(Species, file = "Branching.csv")
> str(Species)
'data.frame': **298 obs. of 7 variables:**
$ name : Factor w/ 2877 levels "Abies alba","Abies nordmanniana",..: 1049 162 1548 47 57 1647 1060 2788 2094 1976 ...
$ SBS.number : int 40171 40398 36280 40532 41629 42495 40103 32792 32892 30583 ...
$ general.method : Factor w/ 5 levels "derivation from morphologies or other plant traits",..: 2 2 2 2 2 2 2 2 2 2 ...
$ branching : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 2 ...
$ valid : int 1 1 1 1 1 1 1 1 1 1 ...
$ reference : Factor w/ 6 levels "Barkman, J.J.(1988): New systems of plant growth forms and phenological plant types",..: 3 3 3 3 3 3 3 3 3 3 ...
$ original.reference: Factor w/ 97 levels "Aarssen, L.W. (1981): The biology of Canadian weeds. 50. Hypochoeris radicata L.",..: 20 3 33 33 33 33 33 44 44 44 ...
Any help or suggestions would be great. Doesn't have to be a loop!
How about this simple loop?
library(dplyr)
for(i in 1:length(30files))
{
csv.matching = read.csv(30files[i]) %>% inner_join(FileA, by=c("SBSname"="name"))
write.csv(csv.matching, file=gsub("\\.csv", "_matchin.csv", 30files[i]), na="")
}

R is not reading my csv file properly?

I am new to R Programming language. I can able to load this csv file into R. it was semicolon separated csv file. There are totally 33 attributes. but R reads it as 1 column, The link for dataset is
https://archive.ics.uci.edu/ml/datasets/Student+Performance#
I had tried using sep(";") while reading csv file in R. i had also tried to convert various formats from csv to text,dif, and nothing works.
Your help is appreciated.
df1 <- read.csv('student-mat.csv', sep=";")
df2 <- read.csv('student-por.csv', sep=";")
Works without any problem. Maybe you had just forgotten to put the equal sign between sep and ";".
> str(df1)
'data.frame': 395 obs. of 33 variables:
$ school : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
$ sex : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
$ age : int 18 17 15 15 16 16 16 17 15 15 ...
$ address : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
$ famsize : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 .
....
Using your data, it works well for me:
student_por <- read.csv2("student-por.csv")
student_mat <- read.csv2("student-mat.csv")

Automating split up of data frame

I have the following data frame in R:
> head(df)
date x y z n t
1 2012-01-01 1 1 1 0 52
2 2012-01-01 1 1 2 0 52
3 2012-01-01 1 1 3 0 52
4 2012-01-01 1 1 4 0 52
5 2012-01-01 1 1 5 0 52
6 2012-01-01 1 1 6 0 52
> str(df)
'data.frame': 4617600 obs. of 6 variables:
$ date: Date, format: "2012-01-01" "2012-01-01" "2012-01-01" "2012-01-01" ...
$ x : Factor w/ 45 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ y : Factor w/ 20 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ z : Factor w/ 111 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
$ n : int 0 0 0 0 0 0 0 0 29 0 ...
$ t : num 52 52 52 52 52 52 52 52 52 52 ...
What I want to do is split this large df into smaller data frames as follows:
1) I want to have 45 data frames for each factor value of 'x'. 2) I want to further split these 45 data frames for each factor value of 'z'. So I want a total of 45*111=4995 data frames.
I've seen plenty online about splitting data frames, which turns them into lists. However, I'm not seeing how to further split lists. Another concern I have is with computer memory. If I split the data frame into lists, will it not still take up as much computer memory? If I then want to run some prediction models on the split data, it seems impossible to do. Ideally I would split the data into many data frames, run prediction models on the first split data frame, get the results I need, and then delete it before moving on to the next one.
Here's what I would do. Your data already fits in memory, so just leave it in one piece:
require(data.table)
setDT(df)
df[,{
sum(t*n) # or whatever you're doing for "prediction models"
},by=list(x,z)]

How can I strip dollar signs ($) from a data frame in R?

I'm quite new to R and am battling a bit with what would appear to be an extremely simple query.
I've imported a csv file into R using read.csv and am trying to remove the dollar signs ($) prior to tidying the data and further analysis (the dollar signs are playing havoc with charting).
I've been trying without luck to strip the $ using dplyr and gsub from the data frame and I'd really appreciate some advice about how to go about it.
My data frame looks like this:
> str(data)
'data.frame': 50 obs. of 17 variables:
$ Year : int 1 2 3 4 5 6 7 8 9 10 ...
$ Prog.Cost : Factor w/ 2 levels "-$3,333","$0": 1 2 2 2 2 2 2 2 2 2 ...
$ Total.Benefits : Factor w/ 44 levels "$2,155","$2,418",..: 25 5 7 11 12 10 9 14 13 8 ...
$ Net.Cash.Flow : Factor w/ 45 levels "-$2,825","$2,155",..: 1 6 8 12 13 11 10 15 14 9 ...
$ Participant : Factor w/ 46 levels "$0","$109","$123",..: 1 1 1 45 46 2 3 4 5 6 ...
$ Taxpayer : Factor w/ 48 levels "$113","$114",..: 19 32 35 37 38 40 41 45 48 47 ...
$ Others : Factor w/ 47 levels "-$9","$1,026",..: 12 25 26 24 23 11 9 10 8 7 ...
$ Indirect : Factor w/ 42 levels "-$1,626","-$2",..: 1 6 10 18 22 24 28 33 36 35 ...
$ Crime : Factor w/ 35 levels "$0","$1","$10",..: 6 11 13 19 21 23 28 31 33 32 ...
$ Child.Welfare : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Education : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Health.Care : Factor w/ 38 levels "-$10","-$11",..: 7 7 7 7 2 8 12 36 30 9 ...
$ Welfare : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Earnings : Factor w/ 41 levels "$0","$101","$104",..: 1 1 1 22 23 24 25 26 27 28 ...
$ State.Benefits : Factor w/ 37 levels "$102","$117",..: 37 1 3 4 6 10 12 18 24 27 ...
$ Local.Benefits : Factor w/ 24 levels "$115","$136",..: 24 1 2 12 14 16 19 22 23 21 ...
$ Federal.Benefits: Factor w/ 39 levels "$0","$100","$102",..: 1 1 1 12 12 17 20 19 19 21 ...
If you need to only remove the $ and do not want to change the class of the columns.
indx <- sapply(data, is.factor)
data[indx] <- lapply(data[indx], function(x)
as.factor(gsub("\\$", "", x)))
If you need numeric columns, you can strip out the , as well (contributed by #David
Arenburg) and convert to numeric by as.numeric
data[indx] <- lapply(data[indx], function(x) as.numeric(gsub("[,$]", "", x)))
You can wrap this in a function
f1 <- function(dat, pat="[$]", Class="factor"){
indx <- sapply(dat, is.factor)
if(Class=="factor"){
dat[indx] <- lapply(dat[indx], function(x) as.factor(gsub(pat, "", x)))
}
else {
dat[indx] <- lapply(dat[indx], function(x) as.numeric(gsub(pat, "", x)))
}
dat
}
f1(data)
f1(data, pat="[,$]", "numeric")
data
set.seed(24)
data <- data.frame(Year=1:6, Prog.Cost= sample(c("-$3,3333", "$0"),
6, replace=TRUE), Total.Benefits= sample(c("$2,155","$2,418",
"$2,312"), 6, replace=TRUE))
If you have to read a lot of csv files with data like this, perhaps you should consider creating your own as method to use with the colClasses argument, like this:
setClass("dollar")
setAs("character", "dollar",
function(from)
as.numeric(gsub("[,$]", "", from, fixed = FALSE)))
Before demonstrating how to use it, let's write #akrun's sample data to a csv file named "A". This would not be necessary in your actual use case where you would be reading the file directly...
## write #akrun's sample data to a csv file named "A"
set.seed(24)
data <- data.frame(
Year=1:6,
Prog.Cost= sample(c("-$3,3333", "$0"), 6, replace = TRUE),
Total.Benefits = sample(c("$2,155","$2,418","$2,312"), 6, replace=TRUE))
A <- tempfile()
write.csv(data, A, row.names = FALSE)
Now, you have a new option for colClasses that can be used with read.csv :-)
read.csv(A, colClasses = c("numeric", "dollar", "dollar"))
# Year Prog.Cost Total.Benefits
# 1 1 -33333 2155
# 2 2 -33333 2312
# 3 3 0 2312
# 4 4 0 2155
# 5 5 0 2418
# 6 6 0 2418
It would probably be more beneficial to just read it again, this time with readLines. I wrote akrun's data to the file "data.text" and fixed the strings before reading the table. Nor sure if the comma was a decimal point or an annoying comma, so I chose decimal point.
r <- gsub("[$]", "", readLines("data.txt"))
read.table(text = r, dec = ",")
# Year Prog.Cost Total.Benefits
# 1 1 -3.3333 2.155
# 2 2 -3.3333 2.312
# 3 3 0.0000 2.312
# 4 4 0.0000 2.155
# 5 5 0.0000 2.418
# 6 6 0.0000 2.418

"Subset" in R does not subset the way I want it to [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
dropping factor levels in a subsetted data frame in R
I am getting a little frustrated with R here, it would be great if anyone could help me with the following:I am trying to pull a subset out of my dataset but it does not work properly.
Specifics:
I have a spreadsheet with words and different features associated with each word
e.g. word article length ...
...
Now I am trying to look at individual words, e.g. pull out all instances where the word is "hairbrush". To do so, I tried:
hairbrush=subset(dataset, word=="hairbrush")
This seems to work fine and gives me the right dataset when I look at it with fix or head. However, as soon as I try to do things like xtabs or any kind of computation, I do not get very far because all the other words are still "there" and mess up my stats. E.g. when I do levels, it gives me "hairbrush", but also all other 200 words. All the data pertaining to these "hidden words" is NA but it still messes up my stats.
Is that the usual behavior of subset? Or am I doing something wrong? Or is this the wrong approach?
Oh, and in some similar questions on Google, people always asked for the output of str, so here it is:
> str(hairbrush)
'data.frame': 41 obs. of 10 variables:
$ id : Factor w/ 1352 levels "1-1-1-11-a.eaf",..: 210 240 267 295 320 351 378 403 427 452 ...
$ speaker : num 24 25 26 28 29 30 32 33 34 35 ...
$ loc : Factor w/ 2 levels "nb","xx": 1 1 1 1 1 1 1 1 1 1 ...
$ gilbertno: Factor w/ 27 levels "1","10","108",..: 1 1 1 1 1 1 1 1 1 1 ...
$ tword : Factor w/ 65 levels "abaddream","afuneral",..: 4 4 4 4 4 4 4 4 4 4 ...
$ word : Factor w/ 228 levels "abbe","aepfel",..: 164 93 99 93 92 100 94 94 28 93 ...
$ loan : Factor w/ 5 levels "FILE","maybe",..: 4 3 5 3 5 5 3 3 3 3 ...
$ article : Factor w/ 40 levels "a","das","dat",..: 34 34 33 33 34 34 34 34 13 34 ...
$ gender : Factor w/ 13 levels "a","af","amn",..: 11 11 7 7 11 11 11 11 7 11 ...
$ comment : Factor w/ 4 levels "0","die macht ja vorschlaege",..: 1 1 1 1 1 1 1 1 1 1 ...
You need to use droplevels after subsetting to clean out unused levels.
subset is working as intended. The problem you are having is due to word being a factor. When you subset the data.frame, subset doesn't redefine your variables, so word continues to carry with it all of the level information that was part of the original dataset. Try using droplevels to drop all of the unused levels from your data.frame.

Resources