Read Fourth column of csv and combine into one file in r - r

I have a large set of csv files, all with the same format. I need to loop through all of them, take the column "Median" (4th column) and write it to a new file where they would all be combined together.
They have the format below.
Wind_Speed Average Median Power_Curve Difference
1 0.0 NaN NA 0 NaN
2 0.5 NaN NA 0 NaN
3 1.0 NaN NA 0 NaN
4 1.5 NaN NA 0 NaN
5 2.0 NaN NA 0 NaN
6 2.5 14.12 14.12 24 -9.9
7 3.0 31.02 31.51 48 -17.0
8 3.5 55.06 57.12 96 -40.9
9 4.0 106.70 109.89 192 -85.3
10 4.5 178.13 180.76 288 -109.9
11 5.0 277.68 278.57 408 -130.3
12 5.5 401.91 400.41 540 -138.1
13 6.0 568.38 569.73 696 -127.6
14 6.5 765.16 762.98 912 -146.8
15 7.0 999.09 1002.82 1104 -104.9
16 7.5 1222.77 1216.91 1332 -109.2
17 8.0 1460.55 1463.50 1524 -63.4
18 8.5 1601.32 1597.00 1656 -54.7
19 9.0 1658.94 1664.40 1680 -21.1
20 9.5 1662.15 1667.81 1692 -29.9
21 10.0 1661.49 1665.47 1692 -30.5
22 10.5 1659.75 1663.02 1692 -32.2
23 11.0 1660.59 1661.13 1692 -31.4
24 11.5 1660.18 1659.44 1692 -31.8
25 12.0 1662.33 1666.21 1692 -29.7
26 12.5 1661.55 1661.10 1692 -30.5
27 13.0 1667.06 1677.50 1692 -24.9
28 13.5 1660.06 1661.63 1692 -31.9
29 14.0 1671.95 1686.82 1692 -20.0
30 14.5 1675.67 1687.73 1692 -16.3
31 15.0 1672.57 1685.97 1692 -19.4
32 15.5 1666.96 1673.73 1692 -25.0
33 16.0 1670.11 1681.58 1692 -21.9
34 16.5 1669.24 1686.14 1692 -22.8
35 17.0 1669.85 1677.95 1692 -22.1
36 17.5 1656.20 1644.46 1692 -35.8
37 18.0 1687.57 1687.57 1692 -4.4
38 18.5 1691.64 1691.69 1692 -0.4
39 19.0 1681.02 1686.78 1692 -11.0
40 19.5 1689.79 1689.79 1692 -2.2
41 20.0 NaN NA 1692 NaN
Ideally the new column name in the new file would be the old file name.
I know it's going working something like below, but I don't know how to write the column in a new table in the next column and keep going for ii.
files2 <- list.files(path="~/test2",pattern="*.csv", full.names=TRUE, recursive=FALSE)
for(ii in files2){
titlename<- tools::file_path_sans_ext(basename(files2))
mydata2 <-read.csv(ii, header = T, stringsAsFactors=FALSE)
mydata2<- mydata2[,4]
???
}

setwd()#set path to where files are
csv_files<-list.files(pattern = "*.csv") #list csv files in path
temp<-NULL #set empty object
for(i in csv_files){
temp[i]<-read.csv(i)[4]# number 4 is the column you want to select, set to what you want..
names(temp)<-stringr::str_remove(names(temp),".csv") #use this line if you want to remove.csv from column name in combined csv
write.csv(temp,"combined.csv",row.names = F)# write combined csv
}
this seems to work for me..

Alternative approach with base-R and lapply:
file <- list.files(path = "~/path", pattern = "\\.csv")
Custom function to read csv, pull filename and assign to column.
(path paste in the read.csv as occassionally can get errors with path in these loops)
read_files_assign_filename <- function(filename){
item <- read.csv(paste("~/path", filename, sep = "/"), header = TRUE)[4]
colnames(item) <- substr(filename,0,nchar(filename)-4) #remove.csv
item #return item
}
Wrap in lapply and rbind to put together into one.
final_result <- do.call(cbind, lapply(files, read_files_assign_filename))
Hope that helps/works!

Related

How to sample data non-random

I have weather dataset my data is date-dependent
I want to predict the temperature from 07 May 2008 until 18 May 2008 (which is maybe a total of 10-15 observations) my data size is around 200
I will be using decision tree/RF and SVM & NN to make my prediction
I've never handled data like this so I'm not sure how to sample non random data
I want to sample data 80% train data and 30% test data but I want to sample the data in the original order not randomly. Is that possible ?
install.packages("rattle")
install.packages("RGtk2")
library("rattle")
seed <- 42
set.seed(seed)
fname <- system.file("csv", "weather.csv", package = "rattle")
dataset <- read.csv(fname, encoding = "UTF-8")
dataset <- dataset[1:200,]
dataset <- dataset[order(dataset$Date),]
set.seed(321)
sample_data = sample(nrow(dataset), nrow(dataset)*.8)
test<-dataset[sample_data,] # 30%
train<-dataset[-sample_data,] # 80%
output
> head(dataset)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
1 2007-11-01 Canberra 8.0 24.3 0.0 3.4 6.3 NW 30
2 2007-11-02 Canberra 14.0 26.9 3.6 4.4 9.7 ENE 39
3 2007-11-03 Canberra 13.7 23.4 3.6 5.8 3.3 NW 85
4 2007-11-04 Canberra 13.3 15.5 39.8 7.2 9.1 NW 54
5 2007-11-05 Canberra 7.6 16.1 2.8 5.6 10.6 SSE 50
6 2007-11-06 Canberra 6.2 16.9 0.0 5.8 8.2 SE 44
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
1 SW NW 6 20 68 29 1019.7
2 E W 4 17 80 36 1012.4
3 N NNE 6 6 82 69 1009.5
4 WNW W 30 24 62 56 1005.5
5 SSE ESE 20 28 68 49 1018.3
6 SE E 20 24 70 57 1023.8
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
1 1015.0 7 7 14.4 23.6 No 3.6 Yes
2 1008.4 5 3 17.5 25.7 Yes 3.6 Yes
3 1007.2 8 7 15.4 20.2 Yes 39.8 Yes
4 1007.0 2 7 13.5 14.1 Yes 2.8 Yes
5 1018.5 7 7 11.1 15.4 Yes 0.0 No
6 1021.7 7 5 10.9 14.8 No 0.2 No
> head(test)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
182 2008-04-30 Canberra -1.8 14.8 0.0 1.4 7.0 N 28
77 2008-01-16 Canberra 17.9 33.2 0.0 10.4 8.4 N 59
88 2008-01-27 Canberra 13.2 31.3 0.0 6.6 11.6 WSW 46
58 2007-12-28 Canberra 15.1 28.3 14.4 8.8 13.2 NNW 28
96 2008-02-04 Canberra 18.2 22.6 1.8 8.0 0.0 ENE 33
126 2008-03-05 Canberra 12.0 27.6 0.0 6.0 11.0 E 46
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
182 E N 2 19 80 40 1024.2
77 N NNE 15 20 58 62 1008.5
88 N WNW 4 26 71 28 1013.1
58 NNW NW 6 13 73 44 1016.8
96 SSE ENE 7 13 92 76 1014.4
126 SSE WSW 7 6 69 35 1025.5
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
182 1020.5 1 7 5.3 13.9 No 0.0 No
77 1006.1 6 7 24.5 23.5 No 4.8 Yes
88 1009.5 1 4 19.7 30.7 No 0.0 No
58 1013.4 1 5 18.3 27.4 Yes 0.0 No
96 1011.5 8 8 18.5 22.1 Yes 9.0 Yes
126 1022.2 1 1 15.7 26.2 No 0.0 No
> head(train)
Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed
7 2007-11-07 Canberra 6.1 18.2 0.2 4.2 8.4 SE 43
9 2007-11-09 Canberra 8.8 19.5 0.0 4.0 4.1 S 48
11 2007-11-11 Canberra 9.1 25.2 0.0 4.2 11.9 N 30
16 2007-11-16 Canberra 12.4 32.1 0.0 8.4 11.1 E 46
22 2007-11-22 Canberra 16.4 19.4 0.4 9.2 0.0 E 26
25 2007-11-25 Canberra 15.4 28.4 0.0 4.4 8.1 ENE 33
WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am
7 SE ESE 19 26 63 47 1024.6
9 E ENE 19 17 70 48 1026.1
11 SE NW 6 9 74 34 1024.4
16 SE WSW 7 9 70 22 1017.9
22 ENE E 6 11 88 72 1010.7
25 SSE NE 9 15 85 31 1022.4
Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow
7 1022.2 4 6 12.4 17.3 No 0.0 No
9 1022.7 7 7 14.1 18.9 No 16.2 Yes
11 1021.1 1 2 14.6 24.0 No 0.2 No
16 1012.8 0 3 19.1 30.7 No 0.0 No
22 1008.9 8 8 16.5 18.3 No 25.8 Yes
25 1018.6 8 2 16.8 27.3 No 0.0 No
I use mtcars as an example. An option to non-randomly split your data in train and test is to first create a sample size based on the number of rows in your data. After that you can use split to split the data exact at the 80% of your data. You using the following code:
smp_size <- floor(0.80 * nrow(mtcars))
split <- split(mtcars, rep(1:2, each = smp_size))
With the following code you can turn the split in train and test:
train <- split$`1`
test <- split$`2`
Let's check the number of rows:
> nrow(train)
[1] 25
> nrow(test)
[1] 7
Now the data is split in train and test without losing their order.

Problem in Converting column's value of data frame in R-language

As a newbie in R, I am trying convert a column's values into numeric in my data frame.
My Code:
hData <- read.csv("outcome-of-care-measures.csv")
temp <- subset(hData, hData$State == "LA")
as.numeric(temp$Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia)
when I paste this
(temp$Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia)
Code in console without conversion it returns:
[1] 12.5 12.7 13.0 9.6 9.8 11.5 9.5 11.2 10.2 11.1 9.7 9.6 9.3 11.0 13.5 9.3 10.6 9.9 12.0
[20] 11.3 12.6 13.1 10.7 11.2 9.9 9.7 9.2 14.2 10.6 10.8 10.1 12.6 12.7 7.4 12.9 10.1 12.9 11.0
[39] 11.7 10.6 8.4 11.7 11.0 10.8 12.6
After conversion it returns:
[1] 26 28 31 118 120 16 117 13 3 12 119 118 115 11 36 115 7 121 21 14 27 32 8 13
[25] 121 119 114 43 7 9 2 27 28 97 30 2 30 11 18 7 106 18 11 9 27
Please help me what I am missing...?
Since your data is stored as factors, it is a little complicated to convert them into numeric. R FAQ states:
How do I convert factors to numeric?
It may happen that when reading numeric data into R (usually, when reading in a file), they come in as factors. If f is such a factor object, you can use
as.numeric(as.character(f))
to get the numbers back. More efficient, but harder to remember, is
as.numeric(levels(f))[as.integer(f)]
In any case, do not call as.numeric() or their likes directly for the task at hand (as as.numeric() or unclass() give the internal codes).
You should try this and it will give you the same results:
as.numeric(levels(temp$Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia))
[temp$Hospital.30.Day.Death..Mortality..Rates.from.Pneumonia]
you can check out the reason here at R FAQ as well.

How to change a column classed as NULL to class integer?

So I'm starting with a dataframe called max.mins that has 153 rows.
day Tx Hx Tn
1 1 10.0 7.83 2.1
2 2 7.7 6.19 2.5
3 3 7.1 4.86 0.0
4 4 9.8 7.37 2.7
5 5 13.4 12.68 0.4
6 6 17.5 17.47 3.5
7 7 16.5 15.58 6.5
8 8 21.5 20.30 6.2
9 9 21.7 21.41 9.7
10 10 24.4 28.18 8.0
I'm applying these statements to the dataframe to look for specific criteria
temp_warnings <- subset(max.mins, Tx >= 32 & Tn >=20)
humidex_warnings <- subset(max.mins, Hx >= 40)
Now when I open up humidex_warnings for example I have this dataframe
row.names day Tx Hx Tn
1 41 10 31.1 40.51 20.7
2 56 25 33.4 42.53 19.6
3 72 11 34.1 40.78 18.1
4 73 12 33.8 40.18 18.8
5 74 13 34.1 41.10 22.4
6 79 18 30.3 41.57 22.5
7 94 2 31.4 40.81 20.3
8 96 4 30.7 40.39 20.2
The next step is to search for 2 or 3 consective numbers in the column row.names and give me a total of how many times this occurs (I asked this in a previous question and have a function that should work once this problem is sorted out). The issue is that row.names is class NULL which is preventing me from applying further functions to this dataframe.
Help? :)
Thanks in advance,
Nick
If you need the row.names as a data as integer:
humidex_warnings$seq <- as.integer(row.names(humidex_warnings))
If you don't need row.names
row.names(humidex_warnings) <- NULL

R program - getting particular values depending on another column

So I have data regarding Id number and time
Id number Time(hr)
1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4
I want this output
Time Id number
10 5
20 10
30 16
40 22
So I want the time to be in 10 hour intervals and get the ID that corresponds to that particular hour...I decided to use this code data <- data2[seq(0, nrow(data2), by=5), ] but instead of the Time being in 10 hr intervals...the ID number is at 10 intervals....but I dont want that output..so far I'm getting this output
Id.number Time..s.
10 19.3
20 36.9
You can use %% (mod) operator.
data[data$Time %% 10 == 0, ]
I use cut() and cumsum(table()) but I don't quite get the answer you are expecting. How exactly are you calculating this?
# first load the data
v.txt <- '1 5
2 6.1
3 7.2
4 8.3
5 9.6
6 10.9
7 13
8 15.1
9 17.2
10 19.3
11 21.4
12 23.5
13 25.6
14 27.1
15 28.6
16 30.1
17 31.8
18 33.5
19 35.2
20 36.9
21 38.6
22 40.3
23 42
24 43.7
25 45.4'
# load in the data... awkwardly...
v <- as.data.frame(matrix(as.numeric(unlist(strsplit(strsplit(v.txt, '\n')[[1]], ' +'))), byrow=T, ncol=2))
tens <- seq(from=0, by=10, to=100)
v$cut <- cut(v$Time, tens, labels=tens[-1])
v2 <- as.data.frame(cumsum(table(v$cut)))
names(v2) <- 'Time'
v2$Id <- rownames(v2)
rownames(v2) <- 1:nrow(v2)
v2 <- v2[,c(2,1)]
rm(v, v.txt, tens) # not needed anymore
v2 # the answer... but doesn't quite match your expected answer...
Id Time
1 10 5
2 20 10
3 30 15
4 40 21
5 50 25

from ffdf to regular dataframe

Is there a way to transform a ffdf into a normal dataframe?
Assuming that the thing is small enough to fit in the ram.
for example:
library(ff)
library(ffbase)
data(trees)
Girth <- ff(trees$Girth)
Height <- ff(trees$Height)
Volume <- ff(trees$Volume)
aktiv <- ff(as.factor(sample(0:1,31,replace=T)))
#Create data frame with some added parameters.
data <- ffdf(Girth=Girth,Height=Height,Volume=Volume,aktiv=aktiv)
rm(Girth,Height,Volume,trees,aktiv)
aktiv <- subset.ffdf(data, data$aktiv== "1" )
and then convert aktiv to data frame and save the RData
(sadly the person waiting the output don't want to learn how to work with the ff package, so I have no choise)
Thanks
Just use as.data.frame:
aktiv <- subset(as.data.frame(data), aktiv == 1)
Girth Height Volume aktiv
2 8.6 65 10.3 1
7 11.0 66 15.6 1
9 11.1 80 22.6 1
12 11.4 76 21.0 1
13 11.4 76 21.4 1
15 12.0 75 19.1 1
17 12.9 85 33.8 1
20 13.8 64 24.9 1
21 14.0 78 34.5 1
23 14.5 74 36.3 1
26 17.3 81 55.4 1
27 17.5 82 55.7 1
28 17.9 80 58.3 1
31 20.6 87 77.0 1
From here you can easily use save or write.csv, e.g.:
save(aktiv, file="aktiv.RData")

Resources