Merge Data frame in for loop - r

input: 6 CSV with different segment, Row 14 Stores different segment
Expected Output:Make a single csv(by appending 6 different CSV) which includes segment also.
library(stringr)
for (i in 1:6){
name<-paste("Page url - Fri. 1 May 2015 - Tue. 19 May 2015 ","(",i,")",".csv",sep="")
CSVlines <- readLines(name)
v1 <- str_extract_all(CSVlines[14], "\\w+")[[1]]
d1 <-read.csv(name,skip=22,header=TRUE)
df1<-cbind(d1, setNames(list(v1[2]), v1[1]))
}

Related

Extract values from first few lines then skip for reading .txt file

Say I have two files, file1.txt and file2.txt that looks like this:
file1.txt
blablabla
lorem ipsum
year: 2007
Jan Feb Mar
1 2 3
4 5 6
file2.txt
blablabla
lorem ipsum
year: 2008
Jan Feb Mar
7 8 9
10 11 12
I can read these files with purrr::map_df(read_table,skip=3)
But what I want to do is extract the year from each file and assign it on a new year column so that my final dataframe looks like this:
Jan Feb Mar Year
1 2 3 2007
4 5 6 2007
7 8 9 2008
10 11 12 2008
I am looking somewhere in the line of using readr::read_lines first then readr::read_table using rlang::exec but don't know how exactly to do this.
Base R implements streaming connections with readLines:
f <- function(path) {
## Open connection and close on exit
zzz <- file(path, open = "rt")
on.exit(close(zzz))
## Read first three lines into character vector and extract year
y <- as.integer(gsub("\\D", "", readLines(zzz, n = 3L)[3L]))
## Read remaining lines into data frame
d <- read.table(zzz, header = TRUE)
d$Year <- y
d
}
nms <- c("file1.txt", "file2.txt")
do.call(rbind, lapply(nms, f))
Jan Feb Mar Year
1 1 2 3 2007
2 4 5 6 2007
3 7 8 9 2008
4 10 11 12 2008
It's not clear to me that readr has this functionality:
library("readr")
zzz <- file("file1.txt", open = "rb")
read_lines(zzz, skip = 2L, n_max = 1L)
## [1] "year: 2007"
read_table(zzz)
## # A tibble: 0 × 0
close(zzz)
Even though we only asked read_lines for the third line of file1.txt, it seems to have (invisibly) read all of the lines, leaving nothing for read_table.
On the other hand, this GitHub issue was "fixed" last year, so it is strange not to see support for streaming connections in the latest release version of readr. Maybe I'm missing something...?
One other solution is use the id argument in read_csv which in your example creates a new column with the file name (eg. "file1.txt") to show which file each row came from.
Note you don't need to use map_df, you can directly pass the file name to read_csv() and it will apply the read_csv to each file and compile to a dataframe.
From there, you can create new dataframe with read_csv of just the 3rd row eg(year:2007) again with the id argument and this time use skip and n_max arguments so that you only pull in the row with "year:2007".
With those two data frames you can then left join based on the column you set with the id argument to pull in the that row!
You will need to extract out the "year" text, which be easily done with the str_extract() argument.
df_missing_year <- readr::read_csv(file=file_path, id="source", skip=3)
df_year_only <- readr::read_csv(file=file_path, id="source",skip=2, n_max=1)
df_complete <- dplyr::left_join(x=df_missing_year, y=df_year_only, by="source")
If you found this helpful, please consider up voting or selecting it has the answer.

test data time series - how can I merge 2 data sets?

I have 2 datasets: one with test results (nedl_1) and another with more test results (subset_cal1) and a time for these tests(dc_time). I'd like to merge the data by ID shown in File column (for ex. 180061). I first transposed subset_cal1 and now column name in both datasets are about the same (except "."). Then I'll try to join them. However, joining them is not possible as one data set is numeric and the other is factor (due to transpose).
I coerced the transposed subset_cal_1 into numeric but dc_time column got coerced into a number. I think I'm forcing something here and I'd rather learn how to do it right because it will come up again.
nedl_1
Wavelength 18005.1 18006.1 18009.1 18010.1 18012.1
1 350 7.920042e-10 8.118013e-10 1.002651e-09 7.379407e-10 9.285596e-10
2 351 7.990535e-10 6.535653e-10 1.275650e-09 5.742704e-10 9.042697e-10
subset_cal1
File dc_time Channels it calibration instrument_num
1 180061 Fri Jan 20 15:37:40 2012 2151 136 1 18006
2 180091 Fri Jan 27 13:30:23 2012 2151 136 1 18009
3 180101 Fri Jan 27 09:41:38 2012 2151 136 1 18010
4 180121 Tue Feb 28 12:15:02 2012 2151 136 1 18012
Here is the code that I used to transpose subset_cal1 and then join with nedl_1
n <- subset_cal1$File # remember the characters in $File
sh_raw <- as.data.frame(t(subset_cal1[,-1])) # transpose all but $File
colnames(sh_raw) <- n # change colnames to those stored in n
dups <- unique(as.list(sh_raw)) # list the duplicate cols
sh_raw_2 <- sh_raw[!duplicated(dups)] # remove duplicate cols
j_raw_nedl <- left_join(sh_raw, nedl_1) #join matching cols
Error: Can't join on '18051.1' x '18051.1' because of incompatible types (numeric / character)

Importing data into R, when data isn't formatted as a table

I have the following tabulator separated .txt file with 9796 lines:
https://www.dropbox.com/s/fnrbmaw8odm2rqs/Kommunale_N%C3%B8gletal.txt?dl=0
I would like to read the file into R, however the file is not in a classic table format. Instead, each variable of interest has 279 rows and 16 columns, where the first row defines the variable name, the first 2 columns define a municipality name and code, and the following 14 define the years from 1993-2006. Each variable is separated by a blank row. The file includes 35 variables.
I would like to read the data into a data.frame, but with one column for the municipality name, the municipality code, and the year, and one column for each of the 35 variables.
In case your are not comfortable following links or prefer a smaller sample, the following illustrates the dataset (2 variables and 3 years of observations):
Indbyggertal 1 januar
Københavns Kommune 101 466129 467253 471300
Frederiksberg Kommune 147 87173 87466 88002
Ballerup Kommune 151 45427 45293 45356
Andel 0-17-årige
Københavns Kommune 101 14.0 14.1 14.4
Frederiksberg Kommune 147 12.4 12.5 12.6
Ballerup Kommune 151 21.2 21.1 21.3
The first 3 lines of the preferred out should look like this:
Municipality name Municipality code Year Indbyggertal 1 januar Andel 0-17-årige … Ældreudg (netto) pr 65+/67+-årig
Københavns Kommune 101 1993 466129 14 35350
Frederiksberg Kommune 147 1993 87173 12.4 33701
Ballerup Kommune 151 1993 45427 21.2 31126
There are probably more ways for doing this, but the trick I used below is to read all data in as text, then determine the positions where new blocks begin, and finally loop through all blocks reading them in and storing them in a list:
lines <- readLines("Kommunale_Nøgletal.txt", encoding = "latin1")
# Find empty lines; these start a new block
start <- c(0, grep("^[\t]+$", lines))
# Read titles
headers <- lines[start + 1]
headers <- gsub("\t", "", headers)
# Determine beginnen and ending of data blocks
begin <- start + 2
end <- c(start[-1]-1, length(lines))
# Read each of the data blocks into a list
data <- vector(mode = "list", length(headers))
for (i in seq_along(headers)) {
block <- lines[begin[i]:end[i]]
data[[i]] <- read.table(textConnection(block), sep="\t", na.strings=c("U","M","-"))
}
names(data) <- headers
Setting the correct headers in each of the data sets should be simple after this and combining then into one data.frame can be done using rbind_all from the dplyr package. Below an example:
# Set columnnames in data
# Add variable name to data
for (i in names(data)) {
names(data[[i]]) <- c("municipality", "code", paste0("Y", 1993:2006))
data[[i]]$var = i
}
# Merge the different datasets into one data.frame
library(dplyr)
data <- rbind_all(data)
# Transpose the data
library(reshape2)
m <- melt(data, id.vars = c("municipality", "code", "var"))
res <- dcast(m, municipality + code + variable ~ var)
# Fix the year variable
names(res)[3] <- "year"
res$year <- as.numeric(gsub("Y", "", res$year))

Quickly create new columns in dataframe using lists - R

I have a data containing quotations of indexes (S&P500, CAC40,...) for every 5 minutes of the last 3 years, which make it quite huge. I am trying to create new columns containing the performance of the index for each time (ie (quotation at [TIME]/quotation at yesterday close) -1) and for each index. I began that way (my data is named temp):
listIndexes<-list("CAC","SP","MIB") # there are a lot more
listTime<-list(900,905,910,...1735) # every 5 minutes
for (j in 1:length(listTime)){
Time<-listTime[j]
for (i in 1:length(listIndexes)) {
Index<-listIndexes[i]
temp[[paste0(Index,"perf",Time)]]<-temp[[paste0(Index,Time)]]/temp[[paste0(Index,"close")]]-1
# other stuff to do but with the same concept
}
}
but it is quite long. Is there a way to get rid of the for loop(s) or to make the creation of those variables quicker ? I read some stuff about the apply functions and the derivatives of it but I do not see if and how it should be used here.
My data looks like this :
date CACcloseyesterday CAC1000 CAC1005 ... CACclose ... SP1000 ... SPclose
20140105 3999 4000 40001.2 4005 .... 2000 .... 2003
20140106 4005 4004 40003.5 4002 .... 2005 .... 2002
...
and my desired output would be a new column (more eaxcatly a new column for each time and each index) which would be added to temp
date CACperf1000 CACperf1005... SPperf1000...
20140106 (4004/4005)-1 (4003.5/4005)-1 .... (2005/2003)-1 # the close used is the one of the day before
idem for the following day
i wrote (4004/4005)-1 just to show the calcualtio nbut the result should be a number : -0.0002496879
It looks like you want to generate every combination of Index and Time. Each Index-Time combination is a column in temp and you want to calculate a new perf column by comparing each Index-Time column against a specific Index close column. And your problem is that you think there should be an easier (less error-prone) way to do this.
We can remove one of the for-loops by generating all the necessary column names beforehand using something like expand.grid.
listIndexes <-list("CAC","SP","MIB")
listTime <- list(900, 905, 910, 915, 920)
df <- expand.grid(Index = listIndexes, Time = listTime,
stringsAsFactors = FALSE)
df$c1 <- paste0(df$Index, "perf", df$Time)
df$c2 <- paste0(df$Index, df$Time)
df$c3 <- paste0(df$Index, "close")
head(df)
#> Index Time c1 c2 c3
#> 1 CAC 900 CACperf900 CAC900 CACclose
#> 2 SP 900 SPperf900 SP900 SPclose
#> 3 MIB 900 MIBperf900 MIB900 MIBclose
#> 4 CAC 905 CACperf905 CAC905 CACclose
#> 5 SP 905 SPperf905 SP905 SPclose
#> 6 MIB 905 MIBperf905 MIB905 MIBclose
Then only one loop is required, and it's for iterating over each batch of column names and doing the calculation.
for (row_i in seq_len(nrow(df))) {
this_row <- df[row_i, ]
temp[[this_row$c1]] <- temp[[this_row$c2]] / temp[[this_row$c3]] - 1
}
An alternative solution would also be to reshape your data into a form that makes this transformation much simpler. For instance, converting into a long, tidy format with columns for Date, Index, Time, Value, ClosingValue column and directly operating on just the two relevant columns there.

Merge two dataframes with repeated columns

I have several .csv files, each one corresponding to a monthly list of customers and some information about them. Each file consists of the same information about customers such as:
names(data.jan)
ID AGE CITY GENDER
names(data.feb)
ID AGE CITY GENDER
To simplify, I will consider only two months, january and february, but my real set of csv files go from january to november:
Considering a "customer X",I have three possible scenarios:
1- Customer X is listed in the january database, but he left and now is not listed in february
2- Customer X is listed in both january and february databases
3- Customer X entered the database in february, so he is not listed in january
I am stuck on the following problem: I need to create a single database with all customers and their respective information that are listed in both dataframes. However, considering a customer that is listed in both dataframes, I want to pick his information from his first entry, that is, january.
When I use merge, I have four options, acording to http://www.dummies.com/how-to/content/how-to-use-the-merge-function-with-data-sets-in-r.html
data <- merge(data.jan,data.feb, by="ID", all=TRUE)
Regardless of which all, all.x or all.y I choose, I get the same undesired output called data:
data[1,]
ID AGE.x CITY.x GENDER.x AGE.y CITY.y GENDER.y
123 25 NY M 25 NY M
I think that what would work here is to merge both databases with this type of join:
Then, merge the resulting dataframe with data.jan with the full outer join. But I don't know how to code this in R.
Thanks,
Bernardo
d1 <- data.frame(x=1:9,y=1:9,z=1:9)
d2 <- data.frame(x=1:10,y=11:20,z=21:30) # example data
d3 <- merge(d1,d2, by="x", all=TRUE) #merge
# keep the original columns from janary (i.e. y.x, z.x)
# but replace the NAs in those columns with the data from february (i.e. y.y,z.y )
d3[is.na(d3[,2]) ,][,2:3] <- d3[is.na(d3[,2]) ,][, 4:5]
#> d3[, 1:3]
# x y.x z.x
#1 1 1 1
#2 2 2 2
#3 3 3 3
#4 4 4 4
#5 5 5 5
#6 6 6 6
#7 7 7 7
#8 8 8 8
#9 9 9 9
#10 10 20 30
This may be tiresome for more than 2 months though, perhaps you should consider #flodel's comments, also note there are demons when your original Jan data has NAs (and you still want the first months data, NA or not, retained) although you never mentioned them in your question.
Try:
data <- merge(data.jan,data.frame(ID=data.feb$ID), by="ID")
although I haven't tested it since no data, but if you just join the ID col from Feb, it should only filter out anything that isn't in both frames
#user1317221_G's solution is excellent. If your tables are large (lots of customers), data tables might be faster:
library(data.table)
# some sample data
jan <- data.table(id=1:10, age=round(runif(10,25,55)), city=c("NY","LA","BOS","CHI","DC"), gender=rep(c("M","F"),each=5))
new <- data.table(id=11:16, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
feb <- rbind(jan[6:10,],new)
new <- data.table(id=17:22, age=round(runif(6,25,55)), city=c("NY","LA","BOS","CHI","DC","SF"), gender=c("M","F"))
mar <- rbind(jan[1:5,],new)
setkey(jan,id)
setkey(feb,id)
join <- data.table(merge(jan, feb, by="id", all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
Edit: This adds processing for multiple months.
f <- function(x,y) {
setkey(x,id)
setkey(y,id)
join <- data.table(merge(x,y,by="id",all=T))
join[is.na(age.x) , names(join)[2:4]:= join[is.na(age.x),5:7,with=F]]
join[,names(join)[5:7]:=NULL] # get rid of extra columns
setnames(join,2:4,c("age","city","gender")) # rename columns that remain
return(join)
}
Reduce("f",list(jan,feb,mar))
Reduce(...) applies the function f(...) to the elements of the list in turn, so first to jan and feb, and then to the result and mar, etc.

Resources