Does data frame levels affect exporting a dataset from R? - r

I have 2142 rows and 9 columns in my data frame. When I call head(df),
the data frame appears fine, something like below:
Local Identifier Local System Parent ID Storage Type Capacity Movable? Storage Unit Order Number
2209 NEZ0037-76 FreezerWorks NEZ0037 BoxPos 1 N 76
2210 NEZ0037-77 FreezerWorks NEZ0037 BoxPos 1 N 77
2211 NEZ0037-78 FreezerWorks NEZ0037 BoxPos 1 N 78
2212 NEZ0037-79 FreezerWorks NEZ0037 BoxPos 1 N 79
2213 NEZ0037-80 FreezerWorks NEZ0037 BoxPos 1 N 80
2214 NEZ0037-81 FreezerWorks NEZ0037 BoxPos 1 N 81
Description Storage.Label
2209 I4
2210 I5
2211 I6
2212 I7
2213 I8
2214 I9`
However, when I call write.csv or write.table, I get an incoherent output. Something like below:
Local Identifier Local System Parent ID Storage Type Capacity Movable
1 NEZ0011 FreezerWorks NEZ0011 Box-9X9 81 Y
39 40 41 42 43 44 45
80 81 "Box-9X9 NEZ0014" 1 2 3 4
38 39 40 41 42 43 44
79 80 81 "Box-9X9 NEZ0017" 1 2 3
37 38 39 40 41 42 43
78 79 80 81 "Box-9X9 NEZ0020" 1 2
36 37 38 39 40 41 42
77 78 79 80 81 "Box-9X9 NEZ0023" 1
35 36 37 38 39 40 41
76 77 78 79 80 81 "Box-9X9 NEZ0026"`
Calling sapply(df, class) reveals that all columns in the data frame are [1] "factor"
except for $Storage.Level which is [1] "data.table" "data.frame". When I called unlist on $Storage.Level, the output is better but it changes the value in the column. I also tried
df <- data.frame(df, stringsAsFactors=FALSE) without success. Also data.frame(lapply(df, factor)) as suggested in the thread here and as.data.frame in the thread here did not work. Is there a way to unlist $Storage.Level without tampering with the values in the column? Or maybe there is a way to change from level "data.table" "data.frame" to factor and output the data safely.
R version 3.0.3 (2014-03-06)

It sounds like you have something like this:
df <- data.frame(A = 1:2, C = 3:4)
df$AC <- data.table(df)
str(df)
# 'data.frame': 2 obs. of 3 variables:
# $ A : int 1 2
# $ C : int 3 4
# $ AC:Classes ‘data.table’ and 'data.frame': 2 obs. of 2 variables:
# ..$ A: int 1 2
# ..$ C: int 3 4
# ..- attr(*, ".internal.selfref")=<externalptr>
sapply(df, class)
# $A
# [1] "integer"
#
# $C
# [1] "integer"
#
# $AC
# [1] "data.table" "data.frame"
If that's the case, you will have trouble writing to a csv file.
Try first calling do.call(data.frame, your_data_frame) to see if that sufficiently "flattens" your data.frame, as it does with this example.
str(do.call(data.frame, df))
# 'data.frame': 2 obs. of 4 variables:
# $ A : int 1 2
# $ C : int 3 4
# $ AC.A: int 1 2
# $ AC.C: int 3 4
You should be able to write this to a csv file without any problems.

Related

Ordering list object of IRanges to get all elements decreasing

I am having difficulties trying to order a list element-wise by decreasing order...
I have a ByPos_Mindex object or a list of 1000 IRange objects (CG_seqP) from
C <- vmatchPattern(CG, CPGi_Seq, max.mismatch = 0, with.indels = FALSE)
IRanges object with 27 ranges and 0 metadata columns:
start end width
<integer> <integer> <integer>
[1] 1 2 2
[2] 3 4 2
[3] 9 10 2
[4] 27 28 2
[5] 34 35 2
... ... ... ...
[23] 189 190 2
[24] 207 208 2
[25] 212 213 2
[26] 215 216 2
[27] 218 219 2
length(1000 of these IRanges)
I then change this to a list of only the start integers (which I want)
CG_SeqP <- sapply(C, function(x) sapply(as.vector(x), "[", 1))
[[1]]
[1] 1 3 9 27 34 47 52 56 62 66 68 70 89 110 112
[16] 136 140 146 154 160 163 178 189 207 212 215 218
(1000 of these)
The Problem happens when I try and order the list of elements using
CG_SeqP <- sapply(as.vector(CG_SeqP),order, decreasing = TRUE)
I get a list of what I think is row numbers so if the first IRAnge object is 27 I get this...
CG_SeqP[1]
[[1]]
[1] 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8
[21] 7 6 5 4 3 2 1
So the decreasing has worked but not for my actual list of elements>?
Any suggestions, thanks in advance.
Order returns order of the sequence not the actual elements of your vector, to extract it let us look at a toy example (I am following your idea here) :
set.seed(1)
alist1 <- list(a = sample(1:100, 30))
So, If you print alist1 with the current seed value , you will have below results:
> alist1
$a
[1] 99 51 67 59 23 25 69 43 17 68 10 77 55 49 29 39 93 16 44
[20] 7 96 92 80 94 34 97 66 31 5 24
Now to sort them either you use sort function or you can use order, sort just sorts the data, whereas order just returns the order number of the elements in a sorted sequence. It doesn't return the actual sequence, it returns the position. Hence we need to put those positions in the actual vector using square notation brackets to get the right sorted outcome.
lapply(as.vector(alist1),function(x)x[order(x, decreasing = TRUE)])
I have used lapply instead of sapply just to enforce the outcome as a list. You are free to choose any command basis your need
Will return:
#> lapply(as.vector(alist1),function(x)x[order(x, decreasing = TRUE)])
#$a
# [1] 99 97 96 94 93 92 80 77 69 68 67 66 59 55 51 49 44 43 39
#[20] 34 31 29 25 24 23 17 16 10 7 5
I hope this clarifies your doubt. Thanks

How to combine three date columns in a data frame into a single variable?

I have a data frame that looks a bit like this:
Type Size `Jul-17` `Aug-17` `Sep-17`
1 A Large 35 24 80
2 B Medium 81 13 38
3 C Small 30 64 45
4 D Large 97 68 65
5 E Medium 31 69 33
6 F Small 84 74 12
I use the ddply function a lot, and instead of summing the three columns together like below...
result <- ddply(Example, .(Type), (summarize),
Q3sum = sum(`Jul-17`, `Aug-17`, `Sep-17`))
I'd like to be able to reference a single variable that contains those three columns and call it "Q3". Is there a way to do this that will still allow the data to work with ddply? I've tried setting the three columns to a single variable using Q3<- c(`Jul-17`, `Aug-17`, `Sep-17`), but it doesn't seem to work.
Any suggestions would be greatly appreciated.
Reproducible data frame:
read.table(check.names = FALSE, text="Type Size Jul-17 Aug-17 Sep-17
A Large 35 24 80
B Medium 81 13 38
C Small 30 64 45
D Large 97 68 65
E Medium 31 69 33
F Small 84 74 12", header=TRUE, stringsAsFactors=FALSE) -> xdf
xdf
## Type Size Jul-17 Aug-17 Sep-17
## 1 A Large 35 24 80
## 2 B Medium 81 13 38
## 3 C Small 30 64 45
## 4 D Large 97 68 65
## 5 E Medium 31 69 33
## 6 F Small 84 74 12
If you just want the sum of the columns into one Q3 column:
xdf$Q3 <- rowSums(xdf[,3:5])
xdf
## Type Size Jul-17 Aug-17 Sep-17 Q3
## 1 A Large 35 24 80 139
## 2 B Medium 81 13 38 132
## 3 C Small 30 64 45 139
## 4 D Large 97 68 65 230
## 5 E Medium 31 69 33 133
## 6 F Small 84 74 12 170
If you want the 3 months making up "Q3" nested into one column:
xdf$q3_alt <- apply(xdf, 1, function(x) { list(as.numeric(x[3:5])) })
xdf
## Type Size Jul-17 Aug-17 Sep-17 Q3 q3_alt
## 1 A Large 35 24 80 139 35, 24, 80
## 2 B Medium 81 13 38 132 81, 13, 38
## 3 C Small 30 64 45 139 30, 64, 45
## 4 D Large 97 68 65 230 97, 68, 65
## 5 E Medium 31 69 33 133 31, 69, 33
## 6 F Small 84 74 12 170 84, 74, 12
str(xdf)
## 'data.frame': 6 obs. of 7 variables:
## $ Type : chr "A" "B" "C" "D" ...
## $ Size : chr "Large" "Medium" "Small" "Large" ...
## $ Jul-17: int 35 81 30 97 31 84
## $ Aug-17: int 24 13 64 68 69 74
## $ Sep-17: int 80 38 45 65 33 12
## $ Q3 : num 139 132 139 230 133 170
## $ q3_alt:List of 6
## ..$ :List of 1
## .. ..$ : num 35 24 80
## ..$ :List of 1
## .. ..$ : num 81 13 38
## ..$ :List of 1
## .. ..$ : num 30 64 45
## ..$ :List of 1
## .. ..$ : num 97 68 65
## ..$ :List of 1
## .. ..$ : num 31 69 33
## ..$ :List of 1
## .. ..$ : num 84 74 12
the solution is the gather function from tidyr. If you use dplyr you can make it in one line of code.
> library(dplyr)
> library(tidyr)
> df%>%
+ gather(key = Q3,value = values,Jul_17:Sep_17)
type size Q3 values
1 1 A Large Jul_17 35
2 2 B Medium Jul_17 81
3 3 C Small Jul_17 30
4 4 D Large Jul_17 97
5 5 E Medium Jul_17 31
6 6 F Small Jul_17 84
7 1 A Large Aug_17 24
8 2 B Medium Aug_17 13
9 3 C Small Aug_17 64
10 4 D Large Aug_17 68
11 5 E Medium Aug_17 69
12 6 F Small Aug_17 74
13 1 A Large Sep_17 80
14 2 B Medium Sep_17 38
15 3 C Small Sep_17 45
16 4 D Large Sep_17 65
17 5 E Medium Sep_17 33
18 6 F Small Sep_17 12
Sounds to me like you want something along the lines of melt from the reshape2 package or gather from the tidyr packge. They will make your data.frame longer with all the Jul-17, Aug-17, and Sep-17 values in one column and another column declaring which month each data point came from.
Check out this nice primer on data tidying.

Excluding rows that match a list of characters

I have a big problem, in my dataframe I have people that are hypertensive, but dont use medication, and people that use medication however have "normal" blood pressure.
For that, I've created a list with all medications by Brazilian Guideline of Hypertension. It worked, but I generated NA values in people that use antihypertensive medication and NA values in people that didnt report use o medication, therefore if I use complete.cases I'm excluding healthy people and sick people.
Here I import data from a SPSS file, that contain the drugs that people reported in the questionaire
library(memisc)
setwd("C:/Users/Rafael/Documents/RStudio")
Med<- as.data.set(spss.system.file("medicamentos_fase4a_pro_saude.sav"))
Med <- Med[c(2,5)]
Med <- as.data.frame(Med)
names(Med)[names(Med) == 'quest'] <- 'Quest'
View(Med)
Medication List
ListedMeds <- c("diuréticos", "carvedilol", "olmesartana", "tiazídicos", "clortalidona", "hidroclorotiazida", "indapamida", "bumetamida", "furosemida", "piretanida", "amilorida ", "espironolactona", "triantereno ", "antihipertensivo", "alfametildopa", "clonidina", "guanabenzo", "moxonidina", "doxazosina", "prazosina",...)
for(m in ListedMeds){ Med = Med[ !grepl(m, Med$med_rec), ] }
library(plyr) #### I use plyr because in the dataframe people that reported more than 1 medication was duplicated, so there were 1 row for each medication from the same person
Med <- ddply(Med, .(Quest), summarize, Rem = paste (med_rec, collapse = ", "))
Merging Med, DF with medications and number of Questionaire and my DF with Blood pressure results.
DFPA <- merge (DFPA, Med, by = "Quest", all = TRUE)
DFPA <- subset(DFPA, select = c(Quest, PASM, PADM, PAM, PP, CCor, CGI, Sexo, FEtária, HAS))
Excluding NA values:
DFPA <- DFPA[complete.cases(DFPA), ]
DFPA <- subset(DFPA, select = c(Quest,PASM, PADM, PAM, PP, CCor, CGI, Sexo, FEtária, HAS))`
I know that I'm not doing nothing in the end, because I'm excluding everyone that has a NA, and it can be a healthy or a sick person. So I wanna know how to exclude all people that match the listed medication.
ps: The list "ListedMeds" contains medications from people that said they use in a regular basis some medication. So, in this cohort I have 4000 people, I've excluded some people based in some parameters, resulting in 2854 people. When I merge Meds with DFPA, the number becomes 3011, however a lot of these people only have information at the column Rem and are NA at the other columns.
ps2: Would it be possible to create a new dataframe with people that were excluded from DFPA, because said that they use antihypertensive medication? Because I think I could resolve the problem, but more than 1000 people were excluded, however I think this number is wrong.
` str(DFPA)
'data.frame': 2854 obs. of 11 variables:
$ Quest : Factor w/ 3041 levels "0001","0002",..: 1 2 3 4 5 6 7 8 10 11 ...
$ PASM : num 116 128 107 112 103 122 112 99 123 120 ...
$ PADM : num 64 86 58 73 69 84 72 62 73 77 ...
$ PAM : num 81 100 74 86 80 97 85 74 90 91 ...
$ PP : num 52 42 49 39 34 38 40 37 50 43 ...
$ Age : num 60 52 53 47 44 61 54 54 33 55 ...
$ Color : Factor w/ 3 levels "B","P","PD": 1 1 1 3 3 1 3 1 1 3 ...
$ Educ : Factor w/ 3 levels "1º","2º","3º": 2 3 3 3 3 2 3 2 2 2 ...
$ Sex : Factor w/ 2 levels "F","M": 1 1 1 2 2 1 1 2 2 1 ...
$ FEtária: Ord.factor w/ 4 levels "A"<"B"<"C"<"D": 4 3 3 3 2 4 3 3 1 3 ...
$ HAS : Ord.factor w/ 4 levels "N"<"P"<"H1"<"H2": 1 2 1 1 1 2 1 1 2 2 ... `
` > str(Med)
List of 2
$ Quest: chr [1:2189] "2" "3" "4" "5" ...
$ Rem : chr [1:2189] "cloreto de sódio, dimenidrato, escopolamina,
fitoterápico, omeprazol, ramipril+anlodipino, sertralina" "colágeno,
dipirona, vitamina e suplemento mineral" "homeopatia" "vitamina e suplemento
mineral" ...
`
Sample:
> mysample
Quest PASM PADM PAM PP Age Color Educ Sex FEtária HAS
133 0133 130 84 99 46 56 PD 1º M C P
1641 1685 146 84 105 62 57 PD 1º M C H1
482 0483 122 78 93 44 64 P 2º F D P
2260 2305 118 78 91 40 54 P 3º F C N
1140 1184 114 70 85 44 63 B 2º M D N
1527 1571 168 98 121 70 56 P 2º M C H2
941 0983 116 73 87 43 65 PD 2º M D N
506 0507 134 90 105 44 60 B 3º M D P
2676 2722 100 60 73 40 50 B 3º M C N
326 0327 106 78 87 28 66 P 2º F D N

How add rownames with no dimensions in R

> Cases <- c(4,46,98,115,88,34)
> Cases
[1] 4 46 98 115 88 34
> str(Cases)
num [1:6] 4 46 98 115 88 34
I want to name row as "total.cases" and I got error attempt to set rownames with no dimensions.please see expected the output to be as follow
total.cases 4 46 98 115 88 34
Your problem is that Cases as you define it is an atomic vector. There is no concept of rows or columns.
I think you probably want a list
Cases <- list(total.cases = c(4,46,98,115,88,34))
Cases
## $total.cases
## [1] 4 46 98 115 88 34
str(Cases)
## List of 1
## $ total.cases: num [1:6] 4 46 98 115 88 34
Do you want to print the output in a particular way or do you actually want rownames?
To print Cases how you want, you could just use:
> cat("total.cases ",Cases,"\n")
total.cases 4 46 98 115 88 34
To assign a rowname, you need to actually have rows first. A vector (like Cases) doesn't have any rows or columns as dimensions. You could however convert to a matrix though:
> matrix(Cases,nrow=1,dimnames=list("total.cases",1:length(Cases)))
1 2 3 4 5 6
total.cases 4 46 98 115 88 34

Throw away first and last n rows

I have a data.table in R where I want to throw away the first and the last n rows. I want to to apply some filtering before and then truncate the results. I know I can do this this way:
example=data.table(row1=seq(1,1000,1),row2=seq(2, 3000,3))
e2=example[row1%%2==0]
e2[100:(nrow(e2)-100)]
Is there a possiblity of doing this in one line? I thought of something like:
example[row1%%2==0][100:-100]
This of course does not work, but is there a simpler solution which does not require a additional variable?
example=data.table(row1=seq(1,1000,1),row2=seq(2, 3000,3))
n = 5
str(example[!rownames(example) %in%
c( head(rownames(example), n), tail(rownames(example), n)), ])
Classes ‘data.table’ and 'data.frame': 990 obs. of 2 variables:
$ row1: num 6 7 8 9 10 11 12 13 14 15 ...
$ row2: num 17 20 23 26 29 32 35 38 41 44 ...
- attr(*, ".internal.selfref")=<externalptr>
Added a one-liner version with the selection criterion
str(
(res <- example[row1 %% 2 == 0])[ n:( nrow(res)-n ), ]
)
Classes ‘data.table’ and 'data.frame': 491 obs. of 2 variables:
$ row1: num 10 12 14 16 18 20 22 24 26 28 ...
$ row2: num 29 35 41 47 53 59 65 71 77 83 ...
- attr(*, ".internal.selfref")=<externalptr>
And further added this version that does not use an intermediate named value
str(
example[row1 %% 2 == 0][n:(sum( row1 %% 2==0)-n ), ]
)
Classes ‘data.table’ and 'data.frame': 491 obs. of 2 variables:
$ row1: num 10 12 14 16 18 20 22 24 26 28 ...
$ row2: num 29 35 41 47 53 59 65 71 77 83 ...
- attr(*, ".internal.selfref")=<externalptr>
In this case you know the name of one column (row1) that exists, so using length(<any column>) returns the number of rows within the unnamed temporary data.table:
example=data.table(row1=seq(1,1000,1),row2=seq(2, 3000,3))
e2=example[row1%%2==0]
ans1 = e2[100:(nrow(e2)-100)]
ans2 = example[row1%%2==0][100:(length(row1)-100)]
identical(ans1,ans2)
[1] TRUE

Resources