My data lengthens each quarter and varies start dates in different data sets.
I have written a code which runs lots of tests and produces forecasts and is automatically documented with graphs and tables of the data.
Everything works fine until the length of data or start date changes because the data in the tables is either not of a correct length or doesnt match up to the correct quarter.
Here is an example:
Test.data <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27)
Test.dates <- c("08Q1","08Q2","08Q3","08Q4","09Q1","09Q2","09Q3","09Q4","10Q1","10Q2","10Q3","10Q4","11Q1","11Q2","11Q3","11Q4","12Q1","12Q2","12Q3","12Q4","13Q1","13Q2","13Q3","13Q4","14Q1","14Q2","14Q3")
Test <- matrix(c(Test.data,""),nrow=4,byrow=FALSE)
colnames(Test) <- c("'08","'09","'10","'11","'12","'13","'14")
rownames(Test) <- c("Qtr 1", "Qtr 2", "Qtr 3", "Qtr 4")
Which quite nicely gives:
'08 '09 '10 '11 '12 '13 '14
Qtr 1 1 5 9 13 17 21 25
Qtr 2 2 6 10 14 18 22 26
Qtr 3 3 7 11 15 19 23 27
Qtr 4 4 8 12 16 20 24
However then in the next quarter the data will increase by 1 and come up with an error:
Warning message:
In matrix(c(Test.data, ""), nrow = 4, byrow = FALSE) :
data length [29] is not a sub-multiple or multiple of the number of rows [4]
Error in `colnames<-`(`*tmp*`, value = c("'08", "'09", "'10", "'11", "'12", :
length of 'dimnames' [2] not equal to array extent
Or if a data set begins in 08Q2 instead of 08Q1 then the data will all be next to the wrong quarter.
I need to display my data in the specific way of:
'yr1 'yr2 'yr3 ...
Qtr 1
Qtr 2
Qtr 3
Qtr 4
Does anyone have any suggestions on how i can get this to automatically change to fit my data without having to change anything (as very soon it will be joined to a database which will constantly produce results so therefore it cannot be changed each time the data is different lengths)
Thankyou for your help.
Please comment below if you want any more information
Test.data.padded <- as.character(Test.data)
length(Test.data.padded) <- ceiling(length(Test.data.padded) / 4) * 4
Test.data.padded[is.na(Test.data.padded)] <- ""
Test <- matrix(Test.data.padded, nrow=4, byrow=FALSE)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7]
#[1,] "1" "5" "9" "13" "17" "21" "25"
#[2,] "2" "6" "10" "14" "18" "22" "26"
#[3,] "3" "7" "11" "15" "19" "23" "27"
#[4,] "4" "8" "12" "16" "20" "24" ""
Then use a regex to extract the years from your Test.dates.
Not sure if this helps.
library(stringi)
n <- 4
l <- length(Test.data)
m1 <- stri_list2matrix(split(Test.data,as.numeric(gl(l,n,l))), fill='')
nm1 <- do.call(rbind,strsplit(Test.dates, '(?<=[0-9])(?=[Q])', perl=TRUE))
dimnames(m1) <- list(unique(nm1[,2]), unique(nm1[,1]))
m1
# 08 09 10 11 12 13 14
#Q1 "1" "5" "9" "13" "17" "21" "25"
#Q2 "2" "6" "10" "14" "18" "22" "26"
#Q3 "3" "7" "11" "15" "19" "23" "27"
#Q4 "4" "8" "12" "16" "20" "24" ""
Related
I am currently trying to create a new matrix by looping over the old one. The thing that I would want to change in the new matrix is replacing certain values with the character "recoding".Both of the matrixes should have 10 columns and 100 rows.
In the current case, the certain value is one that matches with on eof the values in vector_A.
e.g:
for (i in 1:10) {
new_matrix[,i] <- old_matrix[,i]
output_t_or_f <- is.element(new_matrix[,i],unlist(vector_A))
if (any(output_t_or_f, na.rm = FALSE)) {
replace(new_matrix, list = new_matrix[,i], values = "recode")
}
}
so output_t_or_f should either take on the value TRUE or FALSE, depending on whether i is in vector_A
and if output_t_or_f is TRUE then the old value should be replaced with the character "recode"
Currently the new_matrix looks just like the old_matrix so I guess there is a problem with the if statement?
Unfortunately, I can't really share my Data but I put some example data together:
if old_matrix looks like this:
> old_matrix
[,1] [,2] [,3] [,4] [,5]
[1,] 1 6 11 16 21
[2,] 2 7 12 17 22
[3,] 3 8 13 18 23
[4,] 4 9 14 19 24
[5,] 5 10 15 20 25
and vector_A looks like this:
> vector_A
[1] 12 27 30 42 37 9
then the new matrix should looks like this:
new_matrix
[,1] [,2] [,3] [,4] [,5]
[1,] "1" "6" "11" "16" "21"
[2,] "2" "7" "recoding" "17" "22"
[3,] "3" "8" "13" "18" "23"
[4,] "4" "recoding" "14" "19" "24"
[5,] "5" "10" "15" "20" "25"
I am very new to R and can't seem to find the problem. Would appreciate any help!!
Thanks :-)
Since the replacements are the same in every column you shouldn't need a loop. Try this:
new_matrix <- old_matrix
new_matrix[new_matrix %in% vector_A] <- "recode"
This question already has answers here:
How to convert a data frame column to numeric type?
(18 answers)
Closed 2 years ago.
===========================================================================
updates 2/20/2021:
I just look into the problem and found the problem is in the second file,
Sex is originally coded as "F" and "M". When I change it with:
subject.info[subject.info$Sex=='F',]$Sex=1
subject.info[subject.info$Sex=='M',]$Sex=2
the weird thing is R directly changed 1 to "1". And what even more weird is it looks like numeric values when you print it.
My question is why this happens, not how to convert the type of values in a data.frame. I don't understand why someone insists it is a duplicated question, even though similar answers can solve the problem.
=================================================================================
I have two text files. One file is .txt and the other is .csv.
The .csv file has one additional column (with NA values). All the others are the same.
When I read those files with the commands:
subject.info = read.table(paste(data_dir, "outd01_all_subject_info.txt", sep = slash), header=TRUE)
subject.info = read.csv("data_d01_features/outd01_all_subject_info2.txt", sep = ',', header=TRUE, stringsAsFactors = F)
The dataframe subject.info looks the same, but when I run:
as.matrix(subject.info)
All the data in the second file are converted to strings:
SUBJID Sex age trauma_age ptsd
[1,] "600039015048" "2" "11" NA "0"
[2,] "600110937794" "1" "10" NA "0"
[3,] "600129552715" "1" "11" " 8" "2"
[4,] "600210241146" "1" "18" "16" "2"
[5,] "600294620965" "1" "13" NA "0"
[6,] "600409285352" "2" "16" "15" "1"
[7,] "600460215379" "1" "10" NA "0"
[8,] "600547831711" "1" "10" " 6" "1"
[9,] "600561317124" "2" "19" "19" "1"
[10,] "600635899969" "2" "11" NA "0"
[11,] "600647003585" "1" "18" NA "0"
[12,] "600682103788" "1" "18" "15" "2"
[13,] "600689706588" "1" "16" "15" "2"
[14,] "600747749665" "2" " 9" " 7" "1"
This does not happen for the first file:
SUBJID Sex age ptsd
[1,] 600039015048 2 10 0
[2,] 600110937794 1 9 0
[3,] 600129552715 1 10 2
[4,] 600210241146 1 17 2
[5,] 600294620965 1 13 0
[6,] 600409285352 2 15 1
[7,] 600460215379 1 8 0
[8,] 600547831711 1 8 1
[9,] 600561317124 2 19 1
[10,] 600635899969 2 11 0
[11,] 600647003585 1 19 0
[12,] 600682103788 1 18 2
[13,] 600689706588 1 15 2
[14,] 600747749665 2 8 1
Is this due to the NA values? But when I replace NAs with 0 in the second file, the problem still exists:
SUBJID Sex age trauma_age ptsd
[1,] "600039015048" "2" "11" " 0" "0"
[2,] "600110937794" "1" "10" " 0" "0"
[3,] "600129552715" "1" "11" " 8" "2"
[4,] "600210241146" "1" "18" "16" "2"
[5,] "600294620965" "1" "13" " 0" "0"
[6,] "600409285352" "2" "16" "15" "1"
[7,] "600460215379" "1" "10" " 0" "0"
[8,] "600547831711" "1" "10" " 6" "1"
[9,] "600561317124" "2" "19" "19" "1"
[10,] "600635899969" "2" "11" " 0" "0"
[11,] "600647003585" "1" "18" " 0" "0"
[12,] "600682103788" "1" "18" "15" "2"
[13,] "600689706588" "1" "16" "15" "2"
[14,] "600747749665" "2" " 9" " 7" "1"
And this problem still exists if I convert the second file to .csv file, nor if I use read.table, or read.csv2
From the output it look that column trauma_age is of class character which is turning everything into character. Check class(subject.info$trauma_age).
Turn it into numeric by doing :
subject.info$trauma_age <- as.numeric(subject.info$trauma_age)
and then try converting to matrix i.e as.matrix(subject.info).
You can also use type.convert to convert data automatically to respective types without worrying about column names.
subject.info <- type.convert(subject.info, as.is = TRUE)
I have a data.frame in R and the row.names are a character and I would like them to be numeric. I've tried to find the same issue like here but it doesn't work.
Here is my code:
attr(DF1, "row.names")
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"
after I do what I linked above:
DF1$id <- as.integer(row.names(DF1))
DF1[order(DF1$id), ]
I get the same result:
attr(DF1, "row.names")
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20"
and I would like the result to be as in with dataframe D2:
attr(DF2, "row.names")
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
From the help page ?rownames it says (emphasis mine)
For a data frame, value for rownames should be a character vector of non-duplicated and non-missing names (this is enforced), and for colnames a character vector of (preferably) unique syntactically-valid names. In both cases, value will be coerced by as.character, and setting colnames will convert the row names to character.
You could make them an integer like this.
df <- data.frame(x = 1:3)
rownames(df) <- as.character(5:7)
attr(df, "row.names")
#> [1] "5" "6" "7"
rownames(df) <- as.integer(rownames(df))
attr(df, "row.names")
#> [1] 5 6 7
Note that row.names will always return a character vector. See ?row.names.
row.names(df)
#> [1] "5" "6" "7"
I have a vector of n observations. Now I need to create all the possible combinations with those n elements. For example, my vector is
a<-1:4
In my output, combinations should be like,
1
2
3
4
12
13
14
23
24
34
123
124
134
234
1234
How can I get this output?
Thanks in advance.
Something like this could work:
unlist(sapply(1:4, function(x) apply(combn(1:4, x), 2, paste, collapse = '')))
First we get the combinations using combn and then we paste the outputs together. Finally, unlist gives us a vector with the output we need.
Output:
[1] "1" "2" "3" "4" "12" "13" "14" "23" "24" "34" "123" "124"
"134" "234" "1234"
I have a list named d like this:
V1 is an integer set from 0 - 50
V2 is a real set from 1500 - 1800
V3 is an integer set from 1 - 50
In total, the list contains 5100 objects
Now I would like to plot the histogram of V2, with V1 = a certain number (0, 1 or 10, etc.)
I tried different ways:
factor(d$V1)
qplot(V2, data=d, V1 = 1) --> not successful
d.subset <- subset(d, d$V1 = 1) --> not successful
I really get crazy with this. Check the characteristics of d$V1 but found nothing strange. Anyone could help me out?
is.factor(d$V1)
[1] TRUE
str(d$V1) Factor w/ 51 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
levels(d$V1)
[1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19"
[20] "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30" "31" "32" "33" "34" "35" "36" "37""38"
[39] "39" "40" "41" "42" "43" "44" "45" "46" "47" "48" "49" "50" "51"
Change the line:
d.subset <- subset(d, d$V1 = 1)
to
d.subset <- subset(d, V1 == 1)
Notice the double equals (==) to denote the logical operator. = is used for assignment and doesn't subset the data frame.
Finally, you might mean to put the 1 in quotes if you want to get the "1" level of the factor (which might not be the same as the numeric 1).
d.subset <- subset(d, V1 == "1")