CSV conversion in R for standard calculations - r

I have a problem calculating the mean of columns for a dataset imported from this CSV file
I import the file using the following command:
dataGSR = read.csv("ShimmerData.csv", header = TRUE, sep = ",",stringsAsFactors=T)
dataGSR$X=NULL #don't need this column
Then I take a subset of this
dati=dataGSR[4:1000,]
i check they are correct
head(dati)
Shimmer Shimmer.1 Shimmer.2 Shimmer.3 Shimmer.4 Shimmer.5 Shimmer.6 Shimmer.7
4 31329 0 713 623.674691281028 2545 3706.5641025641 2409 3529.67032967033
5 31649 9.765625 713 623.674691281028 2526 3678.89230769231 2501 3664.46886446886
6 31969 19.53125 712 638.528829576655 2528 3681.80512820513 2501 3664.46886446886
7 32289 29.296875 713 623.674691281028 2516 3664.3282051282 2498 3660.07326007326
8 32609 39.0625 711 654.10779696494 2503 3645.39487179487 2496 3657.14285714286
9 32929 48.828125 713 623.674691281028 2505 3648.30769230769 2496 3657.14285714286
When I type
means=colMeans(dati)
Error in colMeans(dati) : 'x' must be numeric
In order to solve this problem I convert everything into a matrix
datiM=data.matrix(dati)
But when I check the new variable, data values are different
head(datiM)
Shimmer Shimmer.1 Shimmer.2 Shimmer.3 Shimmer.4 Shimmer.5 Shimmer.6 Shimmer.7
4 370 1 10 1 65 65 1 1
5 375 3707 10 1 46 46 24 24
6 381 1025 9 2 48 48 24 24
7 386 2162 10 1 36 36 21 21
8 392 3126 8 3 23 23 19 19
9 397 3229 10 1 25 25 19 19
My questions here is:
How to convert correctly the "dati" variable in order to perform the colMeans()?

In addition to #akrun's advice, another option is to convert the columns to numeric yourself (rather than having read.csv do it):
dati <- data.frame(
lapply(dataGSR[-c(1:3),-9],as.numeric))
##
R> colMeans(dati)
Shimmer Shimmer.1 Shimmer.2 Shimmer.3 Shimmer.4 Shimmer.5 Shimmer.6 Shimmer.7
33004.2924 18647.4609 707.4335 718.3989 2521.3626 3672.1383 2497.9013 3659.9287
Where dataGSR was read in with stringsAsFactors=F,
dataGSR <- read.csv(
file="F:/temp/ShimmerData.csv",
header=TRUE,
stringsAsFactors=F)
Unless you know for sure that you need character columns to be factors, you are better off setting this option to FALSE.

The header lines ("character") in the dataset span first 4 lines. We could skip the 4 lines, use header=FALSE and then change the column names based on the info from the first 4 lines.
dataGSR <- read.csv('ShimmerData.csv', header=FALSE,
stringsAsFactors=FALSE, skip=4)
lines <- readLines('ShimmerData.csv', n=4)
colnames(dataGSR) <- do.call(paste, c(strsplit(lines, ','),
list(sep="_")))
dataGSR <- dataGSR[,-9]
unname(colMeans(dataGSR))
# [1] 33004.2924 18647.4609 707.4335 718.3989 2521.3626
# 3672.1383 2497.9013
# [8] 3659.9287

Related

R: How to compare values in a column with later values in the same column

I am attempting to work with a large dataset in R where I need to create a column that compares the value in an existing column to all values that follow it (ex: row 1 needs to compare rows 1-10,000, row 2 needs to compare rows 2-10,000, row 3 needs to compare rows 3-10,000, etc.), but cannot figure out how to write the range.
I currently have a column of raw numeric values and a column of row values generated by:
samples$row = seq.int(nrow(samples))
I have attempted to generate the column with the following command:
samples$processed = min(samples$raw[samples$row:10000])
but get the error "numerical expression has 10000 elements: only the first used" and the generated column only has the value for row 1 repeated for each of the 10,000 rows.
How do I need to write this command so that the lower bound of the range is the row currently being calculated instead of 1?
Any help would be appreciated, as I have minimal programming experience.
If all you need is the min of the specific row and all following rows, then
rev(cummin(rev(samples$val)))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
If you have some other function that doesn't have a cumulative variant (and your use of min is just a placeholder), then one of:
mapply(function(a, b) min(samples$val[a:b]), seq.int(nrow(samples)), nrow(samples))
# [1] 24 24 24 24 24 24 24 24 24 24 24 24 165 165 165 165 410 410 410 882
sapply(seq.int(nrow(samples)), function(a) min(samples$val[a:nrow(samples)]))
The only reason to use mapply over sapply is if, for some reason, you want window-like operations instead of always going to the bottom of the frame. (Though if you wanted windows, I'd suggest either the zoo or slider packages.)
Data
set.seed(42)
samples <- data.frame(val = sample(1000, size=20))
samples
# val
# 1 561
# 2 997
# 3 321
# 4 153
# 5 74
# 6 228
# 7 146
# 8 634
# 9 49
# 10 128
# 11 303
# 12 24
# 13 839
# 14 356
# 15 601
# 16 165
# 17 622
# 18 532
# 19 410
# 20 882

writing out .dat file in r

I have a dataset looks like this:
ids <- c(111,12,134,14,155,16,17,18,19,20)
scores.1 <- c(0,1,0,1,1,2,0,1,1,1)
scores.2 <- c(0,0,0,1,1,1,1,1,1,0)
data <- data.frame(ids, scores.1, scores.1)
> data
ids scores.1 scores.1.1
1 111 0 0
2 12 1 1
3 134 0 0
4 14 1 1
5 155 1 1
6 16 2 2
7 17 0 0
8 18 1 1
9 19 1 1
10 20 1 1
ids stands for student ids, scores.1 is the response/score for the first question, and scores.2 is the response/score for the second question. Student ids vary in terms of the number of digits but scores always have 1 digit. I am trying to write out as .dat file by generating some object and use those in write.fwf function in gdata library.
item.count <- dim(data)[2] - 1 # counts the number of questions in the dataset
write.fwf(data, file = "data.dat", width = c(5,rep(1, item.count)),
colnames = FALSE, sep = "")
I would like to separate the student ids and question response with some spaces,so I would like to use 5 spaces for students ids and to specify that I used width = c(5, rep(1, item.count)) in write.fwf() function. However, the output file looks like this having the spaces at the left side of the student ids
11100
1211
13400
1411
15511
1622
1700
1811
1911
2011
rather than at the right side of the ids.
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11
Any recommendations?
Thanks!
We can use unite to unite the 'score' columns into a single one and then use write.csv
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
with #akrun's help, this gives what I wanted:
library(dplyr)
library(tidyr)
data %>%
unite(scores, starts_with('scores'), sep='')
write.fwf(data, file = "data.dat",
width = c(5,item.count),
colnames = FALSE, sep = " ")
in the .dat file, the dataset looks like this below:
111 00
12 11
134 00
14 11
155 11
16 22
17 00
18 11
19 11
20 11

Dividing a file in R and automatically creating notepad files

I have a file which is like this :
"1943" 359 1327 "t000000" 8
"1944" 359 907 "t000000" 8
"1946" 359 472 "t000000" 8
"1947" 359 676 "t000000" 8
"1948" 326 359 "t000000" 8
"1949" 359 585 "t000000" 8
"1950" 359 1157 "t000000" 8
"2460" 275 359 "t000000" 8
"2727" 22 556 "t000000" 8
"2730" 22 676 "t000000" 8
"479" 17 1898 "t0000000" 5
"864" 347 720 "t000s" 12
"3646" 349 691 "t000s" 7
"6377" 870 1475 "t000s" 14
"7690" 566 870 "t000s" 14
"7691" 870 2305 "t000s" 14
"8120" 870 1179 "t000s" 14
"8122" 44 870 "t000s" 14
"8124" 870 1578 "t000s" 14
"8125" 206 870 "t000s" 14
"8126" 870 1834 "t000s" 14
"6455" 1 1019 "t000t" 13
"4894" 126 691 "t00t" 9
"4896" 126 170 "t00t" 9
"560" 17 412 "t0t" 7
"130" 65 522 "tq" 18
"1034" 17 990 "tq" 10
"332" 3 138 "ts" 2
"2063" 61 383 "ts" 5
"2089" 127 147 "ts" 11
"2431" 148 472 "ts" 15
"2706" 28 43 "ts" 21
.....................
The first column is the random row number ( got after some sorting that I needed ), the fourth column contains the pattern for which I actually want different notepad files.
What I want is that I get individual notepad files named for example, f1.txt,f2.txt,f3.txt...containing all the rows for a value in column 4. For example, I get a different file for "t000000" and then a different one for "t000s" and then a seperate one for "t00t" and so on...
I did this,
list2env(split(sort, sort[,4]),envir=.GlobalEnv)
Here sort is my text file name of data set and 3 is that column.
And then I can use the write.table command, but since my file is huge, I get around 100's of files like that and doing write.table manually like that is very difficult. Is there any way I can automate it?
Using the excellent data.table package:
library(data.table)
# get your source file
the_file <- fread('~/Desktop/file.txt') #replace with your file path
# vector of unique values of column 4 & the roots of your output filename
fl_names <- unique(the_file$V4)
# dump all the relevant subsets to files
for (f in fl_names) write.table(the_file[V4==f, ], paste0(f, '.txt'), row.names=FALSE)
You've already figured out split, but instead of list2env, which will make more work for you just use lapply:
# Generally confusing to name a data.frame
# the same as a common function!
X <- split(sort, sort[, 4])
invisible(lapply(names(X), function(y)
write.csv(X[[y]], file = paste0(y, ".csv"))))
Proof of concept:
Dir <- getwd() # Won't be necessary in your actual script
setwd(tempdir()) # I just don't want my working directory filled
list.files(pattern=".csv") # with random csv files, so I'm using tempdir()
# character(0) # Note that there are no csv files presently
X <- split(sort, sort[, 4]) # You've already figured this step out
## invisible is just so you don't have to see an empty list
## printed in your console. The rest is pretty straightforward
invisible(lapply(names(X), function(y)
write.csv(X[[y]], file = paste0(y, ".csv"))))
list.files(pattern=".csv") # Check that the files are there
# [1] "t000000.csv" "t0000000.csv" "t000s.csv" "t000t.csv"
# [5] "t00t.csv" "t0t.csv" "tq.csv" "ts.csv"
setwd(Dir) # Won't be necessary for your actual script

Iterative grep with r

Ok so thank you guy, I will start again the question:
this is my df
df = read.table(text = ' replicate size fh
ms03a_T0_r1 397.51 1099
ms03a_T0_r1 695.46 8
ms03a_T0_r1 708.76 1409
ms03a_T0_r1 1203.98 102
ms03a_T0_r2 397.52 749
ms03a_T0_r2 493.97 23
ms03a_T0_r2 538.43 12
ms03a_T0_r3 397.49 638
ms03a_T0_r3 399.84 9
ms03a_T0_r3 404.95 33
ms03a_T0_r3 406.85 40 ', header = T)
Rn <- as.numeric(length(levels(ol$replicate)))
# just to calculate the number of samples
From this I would like to have 3 new dataset each one that will contain only rows with *_r1 value of "replicate" variable, rows with *_r2 and rows with *_r3.
I thought to did this whit these commands:
for (i in 1:Rn){
x <- df[as.character(sub('.*_r', '', as.character(replicate))) %in% i];
outfile <- paste("rep_",i,"_edited.txt",sep="")
write.table(x,quote=FALSE,sep=", ",outfile)
}
but I am able to get .txt outputs and not df objects in r. In this way then I will have to import them again in r to move on the next step of my "script", and I have no idea how set r for import them automatically
My guess is you want this:
df = read.table(text = ' replicate size fh
ms03a_T0_r1 397.51 1099
ms03a_T0_r1 695.46 8
ms03a_T0_r1 708.76 1409
ms03a_T0_r1 1203.98 102
ms03a_T0_r2 397.52 749
ms03a_T0_r2 493.97 23
ms03a_T0_r2 538.43 12
ms03a_T0_r3 397.49 638
ms03a_T0_r3 399.84 9
ms03a_T0_r3 404.95 33
ms03a_T0_r3 406.85 40 ', header = T)
library(data.table)
dt = data.table(df)
special.ids = c(1,3)
dt[as.numeric(sub('.*_r', '', as.character(replicate))) %in% special.ids]
# replicate size fh
#1: ms03a_T0_r1 397.51 1099
#2: ms03a_T0_r1 695.46 8
#3: ms03a_T0_r1 708.76 1409
#4: ms03a_T0_r1 1203.98 102
#5: ms03a_T0_r3 397.49 638
#6: ms03a_T0_r3 399.84 9
#7: ms03a_T0_r3 404.95 33
#8: ms03a_T0_r3 406.85 40
Note, as.character is needed because read.table converts strings to factors by default. You may not need that for you actual data if it's already strings.
Upon second reading, maybe you want this?
split(df, sub('.*_r', '', as.character(df$replicate)))
OP - you need to fix your question.

R Read CSV file that has timestamp

I have a csvfile that has a time stamp column as a string
15,1035,4530,3502,2,892,482,0,20060108081608,2,N
15,1034,7828,3501,3,263,256,0,20071124175519,3,N
15,1035,7832,4530,2,1974,1082,0,20071124193818,7,N
15,2346,8381,8155,3,2684,649,0,20080207131002,9,N
I use the read.csv option but the problem with that is once I finish the import the data column looks like:
1 15 1035 4530 3502 2 892 482 0 2.006011e+13 2 N
2 15 1034 7828 3501 3 263 256 0 2.007112e+13 3 N
3 15 1035 7832 4530 2 1974 1082 0 2.007112e+13 7 N
4 15 2346 8381 8155 3 2684 649 0 2.008021e+13 9 N
Is there away to strip the date from string as it get read (csv file does have headers: removed here to keep data anonymous). If we can't strip as it get read can what is the best way to do the strip?
Here 2 methods:
Using zoo package. Personally I prefer this one. I deal with your data as a time series.
library(zoo)
read.zoo(text='15,1035,4530,3502,2,892,482,0,20060108081608,2,N
15,1034,7828,3501,3,263,256,0,20071124175519,3,N
15,1035,7832,4530,2,1974,1082,0,20071124193818,7,N
15,2346,8381,8155,3,2684,649,0,20080207131002,9,N',
index=9,tz='',format='%Y%m%d%H%M%S',sep=',')
V1 V2 V3 V4 V5 V6 V7 V8 V10 V11
2006-01-08 08:16:08 15 1035 4530 3502 2 892 482 0 2 N
2007-11-24 17:55:19 15 1034 7828 3501 3 263 256 0 3 N
2007-11-24 19:38:18 15 1035 7832 4530 2 1974 1082 0 7 N
2008-02-07 13:10:02 15 2346 8381 8155 3 2684 649 0 9 N
Using colClasses argument in read.table, as mentioned in the comment :
dat <- read.table(text='15,1035,4530,3502,2,892,482,0,20060108081608,2,N
15,1034,7828,3501,3,263,256,0,20071124175519,3,N
15,1035,7832,4530,2,1974,1082,0,20071124193818,7,N
15,2346,8381,8155,3,2684,649,0,20080207131002,9,N',
colClasses=c(rep('numeric',8),
'character','numeric','character')
,sep=',')
strptime(dat$V9,'%Y%m%d%H%M%S')
1] "2006-01-08 08:16:08" "2007-11-24 17:55:19"
"2007-11-24 19:38:18" "2008-02-07 13:10:02"
As Ricardo says, you can set the column classes with read.csv. In this case I recommend importing these as characters and once the csv is loaded, converting them to dates with strptime().
for example:
test <- '20080207131002'
strptime(x = test, format = "%Y%m%d%H%M%S")
Which will return a POSIXlt object w/ the date/time info.
You can use lubridate package
test <- '20080207131002'
lubridate::as_datetime(test)
Can also specify format for each case depends on your needs

Resources