I have a matrix 500 row , 1000 cols. every col has 4 elements between them comma, I need to remove the comma.
the data looks like that.
1 2 3 4 ... 1000
1 12,1,20 14,15,12 10,10,20 1,0,10 ... 1,5,3
2 12,1,20 14,15,12 10,10,20 1,0,10 ... 1,5,3
3 12,1,20 14,15,12 10,10,20 1,0,10 ... 1,5,3
.
.
500 12,1,20 14,15,12 10,10,20 1,0,10 ... 1,5,3
My code is
mat=matrix(data=NA, nrow=257, ncol=3)
n=1000
k=500
for(i in 1:n){
mat[i]<-colsplit(as.character(data[,i]), "," , c("a","b","c"))
}
Not working, there is missing in my loop.
Could anyone help me to figure it out, Thanks
If you want to create new columns based on , as delimiter
library(data.table)
library(splitstackshape)
df1 <- cSplit(df, 1:ncol(df), sep=",")[,lapply(.SD, as.numeric)]
df1
# X1_1 X1_2 X1_3 X2_1 X2_2 X2_3 X3_1 X3_2 X3_3 X4_1 X4_2 X4_3
#1: 12 1 20 14 15 12 10 10 20 1 0 10
#2: 12 1 20 14 15 12 10 10 20 1 0 10
#3: 12 1 20 14 15 12 10 10 20 1 0 10
Or use cSplit_f which would be faster for rectangular data (based on comments from the author of splitstackshape package (#Ananda Mahto)
cSplit_f(df, 1:ncol(df), sep=",")[,lapply(.SD, as.numeric)]
str(df1)
# Classes ‘data.table’ and 'data.frame': 3 obs. of 12 variables:
# $ X1_1: num 12 12 12
# $ X1_2: num 1 1 1
# $ X1_3: num 20 20 20
# $ X2_1: num 14 14 14
# $ X2_2: num 15 15 15
# $ X2_3: num 12 12 12
# $ X3_1: num 10 10 10
# $ X3_2: num 10 10 10
# $ X3_3: num 20 20 20
# $ X4_1: num 1 1 1
# $ X4_2: num 0 0 0
# $ X4_3: num 10 10 10
data
df <- structure(list(X1 = c("12,1,20", "12,1,20", "12,1,20"), X2 = c("14,15,12",
"14,15,12", "14,15,12"), X3 = c("10,10,20", "10,10,20", "10,10,20"
), X4 = c("1,0,10", "1,0,10", "1,0,10")), .Names = c("X1", "X2",
"X3", "X4"), class = "data.frame", row.names = c("1", "2", "3"
))
Related
Did some research on this and only found information on reading in multiple CSV files.
I'm trying to create a widget where I can read in a CSV file with data sets and print as many graphs as there are data sets.
But I was trying to brainstorm a means of reading in a CSV with multiple data sets inputted vertically. However, I won't know the length of each data set and I won't know how many data sets would be present.
Any ideas or concepts to consider would be appreciated.
# Create sample data
unlink("so-data.csv") # remove it if it exists
set.seed(1492) # reproducible
# make 3 data frames of different lengths
frames <- lapply(c(3, 10, 5), function(n) {
data.frame(X = runif(n), Y1 = runif(n), Y2= runif(n))
})
# write them to single file preserving the header
suppressWarnings(
invisible(
lapply(frames, write.table, file="so-data.csv", sep=",", quote=FALSE,
append=TRUE, row.names=FALSE)
)
)
That file looks like:
"X","Y1","Y2"
0.277646409813315,0.110495456494391,0.852662623859942
0.21606229362078,0.0521760624833405,0.510357670951635
0.184417578391731,0.00824321852996945,0.390395383816212
"X","Y1","Y2"
0.769067857181653,0.916519832098857,0.971386880846694
0.6415081594605,0.63678711745888,0.148033464793116
0.638599780155346,0.381162445060909,0.989824152784422
0.194932354846969,0.132614633999765,0.845784503268078
0.522090089507401,0.599085820373148,0.218151196138933
0.521618122234941,0.0903550288639963,0.983936473494396
0.792095972690731,0.932019826257601,0.703315682942048
0.12338977586478,0.584303047973663,0.421113619813696
0.343668724410236,0.561827397439629,0.111441049026325
0.660837838426232,0.345943035557866,0.0270762923173606
"X","Y1","Y2"
0.309987690066919,0.441982284653932,0.133840701542795
0.747786369873211,0.240106994053349,0.62044994905591
0.789473889162764,0.853503877297044,0.150850139558315
0.165826949058101,0.119402598123997,0.318282842403278
0.39083837531507,0.109747459646314,0.876092307968065
Now you can do:
# read in the data as lines
l <- readLines("so-data.csv")
# figure out where the individual data sets are
starts <- which(grepl("X", l))
ends <- c((starts[2:length(starts)]-1), length(l))
# read them in
new_frames <- mapply(function(start, end) {
read.csv(text=paste0(l[start:end], collapse="\n"), header=TRUE)
}, starts, ends, SIMPLIFY=FALSE)
str(new_frames)
## List of 3
## $ :'data.frame': 3 obs. of 3 variables:
## ..$ X : num [1:3] 0.278 0.216 0.184
## ..$ Y1: num [1:3] 0.1105 0.05218 0.00824
## ..$ Y2: num [1:3] 0.853 0.51 0.39
## $ :'data.frame': 10 obs. of 3 variables:
## ..$ X : num [1:10] 0.769 0.642 0.639 0.195 0.522 ...
## ..$ Y1: num [1:10] 0.917 0.637 0.381 0.133 0.599 ...
## ..$ Y2: num [1:10] 0.971 0.148 0.99 0.846 0.218 ...
## $ :'data.frame': 5 obs. of 3 variables:
## ..$ X : num [1:5] 0.31 0.748 0.789 0.166 0.391
## ..$ Y1: num [1:5] 0.442 0.24 0.854 0.119 0.11
## ..$ Y2: num [1:5] 0.134 0.62 0.151 0.318 0.876
As #Oriol Mirosa mentioned in the comments, this is one way you can do it. You can first read the whole file:
df = read.csv("path", header = TRUE)
Assuming below is how the whole csv file is structured:
df = data.frame(X=c(1:10, "X", 1:20, "X", 1:30),
Y=c(1:10, "Y", 1:20, "Y", 1:30),
Z=c(1:10, "Z", 1:20, "Z", 1:30))
df$newset = ifelse(df$X == "X", 1, 0)
df$newset = as.factor(cumsum(df$newset))
dfs = split(df, df$newset)
dfs[-1] = lapply(dfs[-1], function(x) x[-1,-ncol(x)])
dfs[[1]] = dfs[[1]][,-ncol(dfs[[1]])]
I created a binary variable newset indicating whether a row is a "header". Then, used cumsum to populate each "dataset" with a unique number. I then split() on newset to create a list of datasets with each element containing one. Finally, I removed the first row of each dataset and made them the column names as desired. This should work no matter the length of each dataset.
Result:
# $`0`
# X Y Z
# 1 1 1 1
# 2 2 2 2
# 3 3 3 3
# 4 4 4 4
# 5 5 5 5
# 6 6 6 6
# 7 7 7 7
# 8 8 8 8
# 9 9 9 9
# 10 10 10 10
#
# $`1`
# X Y Z
# 12 1 1 1
# 13 2 2 2
# 14 3 3 3
# 15 4 4 4
# 16 5 5 5
# 17 6 6 6
# 18 7 7 7
# 19 8 8 8
# 20 9 9 9
# 21 10 10 10
# 22 11 11 11
# 23 12 12 12
# 24 13 13 13
# 25 14 14 14
# 26 15 15 15
# 27 16 16 16
# 28 17 17 17
# 29 18 18 18
# 30 19 19 19
# 31 20 20 20
#
# $`2`
# X Y Z
# 33 1 1 1
# 34 2 2 2
# 35 3 3 3
# 36 4 4 4
# 37 5 5 5
# 38 6 6 6
# 39 7 7 7
# 40 8 8 8
# 41 9 9 9
# 42 10 10 10
# 43 11 11 11
# 44 12 12 12
# 45 13 13 13
# 46 14 14 14
# 47 15 15 15
# 48 16 16 16
# 49 17 17 17
# 50 18 18 18
# 51 19 19 19
# 52 20 20 20
# 53 21 21 21
# 54 22 22 22
# 55 23 23 23
# 56 24 24 24
# 57 25 25 25
# 58 26 26 26
# 59 27 27 27
# 60 28 28 28
# 61 29 29 29
# 62 30 30 30
I am attempting to apply an IF statement through a large list of 64 items. My data takes the following form:
file_list Large list (64 elements, 4.2 Mb)
file1: 'data.frame': 3012 obs. of 4 variables:
..$V1: int[1:3012] 1850 1850 1850 ...
..$V2: int[1:3012] 1 2 3 ...
..$V3: int[1:3012] 16 15 16 ...
..$V4: int[1:3012] 4.69E-05 6.99E-05 5.62E-05 ...
................................................................................
file64: 'data.frame': 5412 obs. of 4 variables:
..$V1: int[1:5412] 1850 1850 1850 ...
..$V2: int[1:5412] 1 2 3 ...
..$V3: int[1:5412] 16 15 16 ...
..$V4: int[1:5412] 6.96E-05 4.99E-05 5.37E-05 ...
What I want to do is multiply the fourth column ($V4) in each of the 64 files by a different number depending on the contents of the second column ($V2). The numbers in $V2 are months of the year, and I need to multiply $V4 by 31 when $V2 is 1, 3, 5, 7, 8, 10 and 12; 30 when $V2 is 4, 6, 9 and 11; and 28.25 when $V2 is 2.
I assume this will involve some sort of for loop, but I haven't been able to complete this task. Any suggestions?
Here's a reproducible solution that uses a small function:
file_list <- list(file1 = data.frame(v1 = sample(1:100, 100, TRUE),
v2 = sample(c(1,2,3,5,6,8,10,4,6,9,11), 100, TRUE),
v4 = rnorm(100)),
file2 = data.frame(v1 = sample(1:100, 100, TRUE),
v2 = sample(c(1,2,3,5,6,8,10,4,6,9,11), 100, TRUE),
v4 = rnorm(100)))
str(file_list)
# List of 2
# $ file1:'data.frame': 100 obs. of 3 variables:
# ..$ v1: int [1:100] 6 90 66 86 32 33 50 46 19 59 ...
# ..$ v2: num [1:100] 5 10 2 10 8 6 10 3 5 5 ...
# ..$ v4: num [1:100] -0.639 -2.234 -0.816 0.997 -0.302 ...
# $ file2:'data.frame': 100 obs. of 3 variables:
# ..$ v1: int [1:100] 34 25 24 4 100 59 80 100 21 97 ...
# ..$ v2: num [1:100] 3 6 8 8 9 1 8 1 3 3 ...
# ..$ v4: num [1:100] -2.2599 0.0548 -1.1666 -0.4049 0.4681 ...
myFun <- function(df) {
df$v4[df$v2 %in% c(1,3,5,7,8,10,12)] <- df$v4[df$v2 %in% c(1,3,5,7,8,10,12)] * 31
df$v4[df$v2 %in% c(4,6,9,11)] <- df$v4[df$v2 %in% c(4,6,9,11)] * 30
df$v4[df$v2 == 2] <- df$v4[df$v2 == 2] * 28.25
df
}
lapply(file_list, myFun)
# lapply(file_list, FUN = function(x) head(myFun(x)))
# $file1
# v1 v2 v4
# 1 6 5 -19.816836
# 2 90 10 -69.264329
# 3 66 2 -23.054110
# 4 86 10 30.910798
# 5 32 8 -9.347289
# 6 33 6 -16.316746
#
# $file2
# v1 v2 v4
# 1 34 3 -70.055942
# 2 25 6 1.642744
# 3 24 8 -36.165864
# 4 4 8 -12.550877
# 5 100 9 14.041857
# 6 59 1 -2.556662
I have a data frame (labels) that I would like to use as a reference or lookup table of the form:
V1 V2
1 1 WALKING
2 2 WALKING_UPSTAIRS
3 3 WALKING_DOWNSTAIRS
4 4 SITTING
5 5 STANDING
6 6 LAYING
The data frame to use the reference table is (test, ncol = 564, nrow = 2947) where the first three colnames are (test_subject, test_label(num 1-6), data_set) where test_label(1-6) equal the strings referenced above.
Could someone help me figure out how I can use my lookup table to insert a new column called "activity_label" and each observation of that column would correspond to the string equivalent of the referenced number from the reference table.
E.g., if test_label row 1 equals 5 then activity_label row 1 would equal "Standing"
Thanks so much for all of your help!
#
After using the merge method:
> test2[1:10, 564: 565]
angle(Z,gravityMean) activity_label
1 0.04404283 walking
2 0.04134032 walking
3 0.04295217 walking
4 0.03611571 walking
5 -0.09080307 walking
6 -0.08602478 walking
7 -0.07997668 walking
8 0.04372663 walking
9 0.19900166 walking
10 0.20350821 walking
analyzing structure of the remaining dfs
> str(test1)
'data.frame': 2947 obs. of 565 variables:
$ test_labels : int 1 1 1 1 1 1 1 1 1 1 ...
$ test_subject : int 12 12 12 12 4 4 4 12 9 9 ...
$ observ_set : Factor w/ 1 level "test": 1 1 1 1 1 1 1 1 1 1 ...
$ tBodyAcc-mean()-X : num 0.228 0.303 0.237 0.306 0.29 ...
> str(train1)
'data.frame': 7352 obs. of 565 variables:
$ train_labels : int 1 1 1 1 1 1 1 1 1 1 ...
$ V1 : int 27 7 7 26 7 26 6 6 6 7 ...
$ observ_set : Factor w/ 1 level "train": 1 1 1 1 1 1 1 1 1 1 ...
$ tBodyAcc-mean()-X : num 0.262 0.354 0.344 0.292 0.314 ...
One way is to use ifelse :
if data frame = test and activity number column = activitynum,
test$activitylabel <- ifelse(test$activitynum == 1, "walking, ifelse(test$activitynum == 2, "walking_upstairs", ifelse(test$activitynum == 3, "walking_downstairs", ifelse(test$activitynum == 4, "sitting", ifelse(test$activitynum == 5, "standing", ifelse(test$activitynum == 6, "laying", NA))))))
another way is to create a look-up table and then do a merge as suggested by #Jaehyeon:
lookup <- data.frame(activitynum = c(1,2,3,4,5,6), activity = c("walking", "walking_upstairs", "walking_downstairs", "standing", "sitting", "laying"))
survey <- data.frame(id = c(seq(1:10)), activitynum = floor(runif(10, 1, 7)), var1 = runif(10, 1, 100))
merge(survey, lookup, by = "activitynum", all.x = TRUE)
> str(lookup)
'data.frame': 6 obs. of 2 variables:
$ activitynum: num 1 2 3 4 5 6
$ activity : Factor w/ 6 levels "laying","sitting",..: 4 6 5 3 2 1
> str(survey)
'data.frame': 10 obs. of 3 variables:
$ id : int 1 2 3 4 5 6 7 8 9 10
$ activitynum: num 1 2 4 1 4 6 2 4 2 2
$ var1 : num 52.3 60.5 53.3 49.8 73.1 ...
I'd do as following. Mapping is done by 'test_label' and 'id' and they are merged using merge(). If you want to keep all values in df, use all.x = T. Otherwise remove it.
set.seed(1237)
lookup <- data.frame(id = 1:6, activity = LETTERS[1:6])
df <- data.frame(test_label = sample(1:6, 10, replace = T))
merge(df, lookup, by.x = "test_label", by.y ="id", all.x = T)
test_label activity
1 1 A
2 1 A
3 2 B
4 2 B
5 3 C
6 5 E
7 5 E
8 6 F
9 6 F
10 6 F
I'm quite new to R and am battling a bit with what would appear to be an extremely simple query.
I've imported a csv file into R using read.csv and am trying to remove the dollar signs ($) prior to tidying the data and further analysis (the dollar signs are playing havoc with charting).
I've been trying without luck to strip the $ using dplyr and gsub from the data frame and I'd really appreciate some advice about how to go about it.
My data frame looks like this:
> str(data)
'data.frame': 50 obs. of 17 variables:
$ Year : int 1 2 3 4 5 6 7 8 9 10 ...
$ Prog.Cost : Factor w/ 2 levels "-$3,333","$0": 1 2 2 2 2 2 2 2 2 2 ...
$ Total.Benefits : Factor w/ 44 levels "$2,155","$2,418",..: 25 5 7 11 12 10 9 14 13 8 ...
$ Net.Cash.Flow : Factor w/ 45 levels "-$2,825","$2,155",..: 1 6 8 12 13 11 10 15 14 9 ...
$ Participant : Factor w/ 46 levels "$0","$109","$123",..: 1 1 1 45 46 2 3 4 5 6 ...
$ Taxpayer : Factor w/ 48 levels "$113","$114",..: 19 32 35 37 38 40 41 45 48 47 ...
$ Others : Factor w/ 47 levels "-$9","$1,026",..: 12 25 26 24 23 11 9 10 8 7 ...
$ Indirect : Factor w/ 42 levels "-$1,626","-$2",..: 1 6 10 18 22 24 28 33 36 35 ...
$ Crime : Factor w/ 35 levels "$0","$1","$10",..: 6 11 13 19 21 23 28 31 33 32 ...
$ Child.Welfare : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Education : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Health.Care : Factor w/ 38 levels "-$10","-$11",..: 7 7 7 7 2 8 12 36 30 9 ...
$ Welfare : Factor w/ 1 level "$0": 1 1 1 1 1 1 1 1 1 1 ...
$ Earnings : Factor w/ 41 levels "$0","$101","$104",..: 1 1 1 22 23 24 25 26 27 28 ...
$ State.Benefits : Factor w/ 37 levels "$102","$117",..: 37 1 3 4 6 10 12 18 24 27 ...
$ Local.Benefits : Factor w/ 24 levels "$115","$136",..: 24 1 2 12 14 16 19 22 23 21 ...
$ Federal.Benefits: Factor w/ 39 levels "$0","$100","$102",..: 1 1 1 12 12 17 20 19 19 21 ...
If you need to only remove the $ and do not want to change the class of the columns.
indx <- sapply(data, is.factor)
data[indx] <- lapply(data[indx], function(x)
as.factor(gsub("\\$", "", x)))
If you need numeric columns, you can strip out the , as well (contributed by #David
Arenburg) and convert to numeric by as.numeric
data[indx] <- lapply(data[indx], function(x) as.numeric(gsub("[,$]", "", x)))
You can wrap this in a function
f1 <- function(dat, pat="[$]", Class="factor"){
indx <- sapply(dat, is.factor)
if(Class=="factor"){
dat[indx] <- lapply(dat[indx], function(x) as.factor(gsub(pat, "", x)))
}
else {
dat[indx] <- lapply(dat[indx], function(x) as.numeric(gsub(pat, "", x)))
}
dat
}
f1(data)
f1(data, pat="[,$]", "numeric")
data
set.seed(24)
data <- data.frame(Year=1:6, Prog.Cost= sample(c("-$3,3333", "$0"),
6, replace=TRUE), Total.Benefits= sample(c("$2,155","$2,418",
"$2,312"), 6, replace=TRUE))
If you have to read a lot of csv files with data like this, perhaps you should consider creating your own as method to use with the colClasses argument, like this:
setClass("dollar")
setAs("character", "dollar",
function(from)
as.numeric(gsub("[,$]", "", from, fixed = FALSE)))
Before demonstrating how to use it, let's write #akrun's sample data to a csv file named "A". This would not be necessary in your actual use case where you would be reading the file directly...
## write #akrun's sample data to a csv file named "A"
set.seed(24)
data <- data.frame(
Year=1:6,
Prog.Cost= sample(c("-$3,3333", "$0"), 6, replace = TRUE),
Total.Benefits = sample(c("$2,155","$2,418","$2,312"), 6, replace=TRUE))
A <- tempfile()
write.csv(data, A, row.names = FALSE)
Now, you have a new option for colClasses that can be used with read.csv :-)
read.csv(A, colClasses = c("numeric", "dollar", "dollar"))
# Year Prog.Cost Total.Benefits
# 1 1 -33333 2155
# 2 2 -33333 2312
# 3 3 0 2312
# 4 4 0 2155
# 5 5 0 2418
# 6 6 0 2418
It would probably be more beneficial to just read it again, this time with readLines. I wrote akrun's data to the file "data.text" and fixed the strings before reading the table. Nor sure if the comma was a decimal point or an annoying comma, so I chose decimal point.
r <- gsub("[$]", "", readLines("data.txt"))
read.table(text = r, dec = ",")
# Year Prog.Cost Total.Benefits
# 1 1 -3.3333 2.155
# 2 2 -3.3333 2.312
# 3 3 0.0000 2.312
# 4 4 0.0000 2.155
# 5 5 0.0000 2.418
# 6 6 0.0000 2.418
Compare the behavior of data.table and data.frame below:
a.matrix <- matrix(seq_len(25),ncol = 5, nrow = 5)
a.list <- list(seq_len(5),a.matrix)
a.dt <- as.data.table(a.list)
a.df <- as.data.frame(a.list)
a.dt.df <- as.data.table(a.df)
str(a.dt)
str(a.df)
str(a.dt.df)
data.table recycles the columns of the matrix into a vector of appropriate length:
> str(a.dt)
Classes ‘data.table’ and 'data.frame': 25 obs. of 2 variables:
$ V1: int 1 2 3 4 5 1 2 3 4 5 ...
$ V2: int 1 2 3 4 5 6 7 8 9 10 ...
- attr(*, ".internal.selfref")=<externalptr>
On the other hand, data.frame breaks each column out:
> str(a.df)
'data.frame': 5 obs. of 6 variables:
$ X1.5: int 1 2 3 4 5
$ X1 : int 1 2 3 4 5
$ X2 : int 6 7 8 9 10
$ X3 : int 11 12 13 14 15
$ X4 : int 16 17 18 19 20
$ X5 : int 21 22 23 24 25
My current workaround to get this behavior quickly with as.data.table is just to feed it through both as coercers:
> str(a.dt.df)
Classes ‘data.table’ and 'data.frame': 5 obs. of 6 variables:
$ X1.5: int 1 2 3 4 5
$ X1 : int 1 2 3 4 5
$ X2 : int 6 7 8 9 10
$ X3 : int 11 12 13 14 15
$ X4 : int 16 17 18 19 20
$ X5 : int 21 22 23 24 25
- attr(*, ".internal.selfref")=<externalptr>
Why is there a difference, and is there a fast way to get the data.frame behavior with data.table?
Just to close this on the SO end, as mentioned in the comments, this is being handled as a bug/issue at github now, added to data.table milestone v1.9.8 of this writing.
Follow-up
This is now resolved as per commit 64f377...