reading txt file and converting it to dataframe - r

I have a .txt file that consists of some investment data. I want to convert the data in file to data frame with three columns. Data in .txt file looks like below.
Date:
06-04-15, 07-04-15, 08-04-15, 09-04-15, 10-04-15
Equity :
-237.79, -170.37, 304.32, 54.19, -130.5
Debt :
16318.49, 9543.76, 6421.67, 3590.47, 2386.3

If you are going to use read.table(), then the following may help:
Assuming the dat.txt contains above contents, then
dat <- read.table("dat.txt",fill=T,sep = ",")
df <- as.data.frame(t(dat[seq(2,nrow(dat),by=2),]))
rownames(df) <- seq(nrow(df))
colnames(df) <- trimws(gsub(":","",dat[seq(1,nrow(dat),by=2),1]))
yielding:
> df
Date Equity Debt
1 06-04-15 -237.79 16318.49
2 07-04-15 -170.37 9543.76
3 08-04-15 304.32 6421.67
4 09-04-15 54.19 3590.47
5 10-04-15 -130.5 2386.3

Assuming the text file name is demo.txt here is one way to do this
#Read the file line by line
all_vals <- readLines("demo.txt")
#Since the column names and data are in alternate lines
#We first gather column names together and clean them
column_names <- trimws(sub(":", "", all_vals[c(TRUE, FALSE)]))
#we can then paste the data part together and assign column names to it
df <- setNames(data.frame(t(read.table(text = paste0(all_vals[c(FALSE, TRUE)],
collapse = "\n"), sep = ",")), row.names = NULL), column_names)
#Since most of the data is read as factors, we use type.convert to
#convert data in their respective format.
type.convert(df)
# Date Equity Debt
#1 06-04-15 -237.79 16318.49
#2 07-04-15 -170.37 9543.76
#3 08-04-15 304.32 6421.67
#4 09-04-15 54.19 3590.47
#5 10-04-15 -130.50 2386.30

Related

Create unique Id in R by combining 2 columns

I am reading in 2 big .TXT files and filtering them based off a certain code. The codes are located in the 16th column of each file.
Colleges <- read.table("Colleges.txt", sep ="|", fill = TRUE)
Majors <- read.table("Majors.txt", sep ="|", fill = TRUE)
The Data looks like this
bld_name dpt_name majors admin code college year
MLK English Literature Ms. W T A&S 18
Freedom Math Stats Ms. B R STEM 18
MLK Math CALC Ms. B P STEM 18
After I create the subset and append the two files. I want to create a unique ID using bld_name and dpt_name.
college_sub <- subset(colleges,colleges[[16]] %in% c("T", "R"), drop = TRUE)
majors_sub <- subset(majors,majors[[16]] %in% c("T", "R"), drop = TRUE)
combine <- do.call(rbind,list(college_sub,majors_sub)) #Append both files
uniqueID$id <- paste(combine$dpt_name,"-",combine$bld_name)
cols_g <- c("dpt_name", "Majors", "Admin", "Year")
combine <- combine[,cols_g]
It should look like this:
Unique ID majors admin code college year
MLK-English Literature Ms. W T A&S 18

Reformatting downloaded Excel data

I have downloaded some GDP data in .xls-format from the OECD website. However, to make this data workable in R, I need to reformat the data to a .csv file. More specifically, I need the year, day and month in the first column, and after the comma I need the GDP values (for example: 1990-01-01, 234590).
The column with GDP values can be easily copied and transposed, but how does one quickly add dates? Is there a fast way to do this, without having to add in the dates manually?
Thanks for the help!
Best,
Sean
PS. Link to (one of) the specific OECD files: https://ufile.io/8ogav or https://stats.oecd.org/index.aspx?queryid=350#
PSS. I have now changed the file to this:
Which I would like to transform into the same style as example 1.
Codes that I use for reading in data:
gdp.start <- c(1970,1) # type "double"
gdp.end <- c(2018,1)
gdp.raw <- "rawData/germany_gdp.csv"
gdp.table <- read.table(gdp.raw, skip = 1, header = F, sep = ',', stringsAsFactors = F)
gdp.ger <- ts(gdp.table[,2], start = gdp.start, frequency = 4) # time-series representation
PSS.
dput(head(gdp.table))
structure(list(V1 = c("Q2-1970;1.438.810 ", "Q3-1970;1.465.684 ",
"Q4-1970;1.478.108 ", "Q1-1971;1.449.712 ", "Q2-1971;1.480.136 ",
"Q3-1971;1.505.743 ")), row.names = c(NA, 6L), class = "data.frame")
Using your data:
z <- structure(list(V1 = c("Q2-1970;1.438.810 ", "Q3-1970;1.465.684 ",
"Q4-1970;1.478.108 ", "Q1-1971;1.449.712 ", "Q2-1971;1.480.136 ",
"Q3-1971;1.505.743 ")), row.names = c(NA, 6L), class = "data.frame")
dat <- read.csv2(text=paste(z$V1, collapse='\n'), stringsAsFactors=FALSE, header=FALSE)
dat
# V1 V2
# 1 Q2-1970 1.438.810
# 2 Q3-1970 1.465.684
# 3 Q4-1970 1.478.108
# 4 Q1-1971 1.449.712
# 5 Q2-1971 1.480.136
# 6 Q3-1971 1.505.743
and a simple function to replace quarters with the first date of each quarter
quarters <- function(s, format) {
qs <- c("Q1","Q2","Q3","Q4")
dts <- c("01-01", "04-01", "07-01", "10-01")
for (i in seq_along(qs))
s <- sub(qs[i], dts[i], s)
if (! missing(format))
s <- as.Date(s, format=format)
s
}
We can change them into strings of dates, preserving the order:
str(quarters(dat$V1))
# chr [1:6] "04-01-1970" "07-01-1970" "10-01-1970" "01-01-1971" ...
or we can convert into Date objects by setting the format:
str( quarters(dat$V1, format='%m-%d-%Y') )
# Date[1:6], format: "1970-04-01" "1970-07-01" "1970-10-01" "1971-01-01" ...
so replacing the column with the actual Date object is simply dat$V1 <- quarters(dat$V1, format='%m-%d-%Y').

How do I average a sentiment score for a day with multiple texts?

I am doing a text sentiment analysis in R using the tm package. I have scraped news articles from Reuters and gave them a variable name according to their date. I added a,b,c etc. to indicate multiple articles per day, like this:
art170411a
art170411b
art170411c
art170410a
...
...
I then run a standard positive/negative terms analysis which gives me the sentiment score per article. My question is: how do I average these scores so that I get a sentiment score per day?
I have a VCorpus containing my 2000+ articles over 3 years. Every article has a date stamp. For the matching with the positive/negative terms I have converted my Corpus to a list and then a bag of words like this:
corp_list <- lapply(corp, FUN = paste, collapse=" ")
corp_bag <- str_split(corp_list, pattern = "\\s+")
I have the final score in two formats:
score_naive_list <- lapply(corp_bag, function(x) { sum(!is.na(match(x, pos))) - sum(!is.na(match(x, neg)))})
score_naive <- unlist(lapply(corp_bag, function(x) { sum(!is.na(match(x, pos))) - sum(!is.na(match(x, neg)))}))
So my question: how do I average the multiple sentiment scores into a one day score?
I redid my answer with reproducible data, once you get your data sorted this should work just fine.
library(tm)
reut21578 <- system.file("texts", "crude", package = "tm")
corp <- VCorpus(DirSource(reut21578),readerControl = list(reader = readReut21578XMLasPlain))
timestamps <- meta(reuters,"datetimestamp")
days <- sapply(timestamps,strftime,format="%Y-%m-%d")
pos <- c("good","excellent","positive","effective")
neg <- c("bad","terrible","negative")
corp_list <- lapply(corp, FUN = paste, collapse=" ")
daily_bows <- aggregate(corp_list ~ days,data.frame(corp_list = unlist(corp_list),days = days),FUN=paste,collapse = " ")
corp_bag <- str_split(daily_bows$corp_list, pattern = "\\s+")
score_string <- function(x){
sum(!is.na(match(x, pos))) - sum(!is.na(match(x, neg)))
}
daily_bows$scores <- sapply(corp_bag,score_string)
print(daily_bows[,c("days","scores")])
# days scores
# 1 1987-02-26 3
# 2 1987-03-01 1
# 3 1987-03-02 1

Reshaping a dataframe in R

I need some help to re-design the output of a function that comes through an R package.
My scope is to reshape a dataframe called output_IMFData in a way that look very similar to the shape of output_imfr. The codes of a MWE reproducing these dataframes are:
library(imfr)
output_imfr <- imf_data(database_id="IFS", indicator="IAD_BP6_USD", country = "", start = 2010, end = 2014, freq = "A", return_raw =FALSE, print_url = T, times = 3)
and for output_IMFData
library(IMFData)
databaseID <- "IFS"
startdate <- "2010"
enddate <- "2014"
checkquery <- FALSE
queryfilter <- list(CL_FREA = "A", CL_AREA_IFS = "", CL_INDICATOR_IFS = "IAD_BP6_USD")
output_IMFData <- CompactDataMethod(databaseID, queryfilter, startdate, enddate,
checkquery)
the output from output_IMFData looks like this:
But, I want to redesign this dataframe to look like the output of output_imfr:
Sadly, I am not that advanced user and could not find something that can help me. My basic problem in converting the shape of output_IMFData to the shape of the second ``panel-data-looking" dataframework is that I don't know how to handle the Obs in output_IMFData in a way that cannot lose the "correspondence" with the reference code #REF-AREA in output_IMFData. That is, in column #REF-AREA there are codes of country names and the column in Obs has their respective time series data. This is very cumbersome way of working with panel data, and therefore I want to reshape that dataframe to the much nicer form of output_imfr dataframe.
The data of interest are stored in a list in the column Obs. Here is a dplyr solution to split the data, crack open the list, then stitch things back together.
longData <-
output_IMFData %>%
split(1:nrow(.)) %>%
lapply(function(x){
data.frame(
iso2c = x[["#REF_AREA"]]
, x$Obs
)
}) %>%
bind_rows()
head(longData)
gives:
iso2c X.TIME_PERIOD X.OBS_VALUE X.OBS_STATUS
1 FJ 2010 47.2107721901621 <NA>
2 FJ 2011 48.28347 <NA>
3 FJ 2012 51.0823499999999 <NA>
4 FJ 2013 157.015648875072 <NA>
5 FJ 2014 186.623232882226 <NA>
6 AW 2010 616.664804469274 <NA>
Here's another approach:
NewDataFrame <- data.frame(iso2c=character(),
year=numeric(),
IAD_BP6_USD=character(),
stringsAsFactors=FALSE)
newrow = 1
for(i in 1:nrow(output_IMFData)) { # for each row of your cludgy df
for(j in 1:length(output_IMFData$Obs[[i]]$`#TIME_PERIOD`)) { # for each year
NewDataFrame[newrow,'iso2c']<-output_IMFData[i, '#REF_AREA']
NewDataFrame[newrow,'year']<-output_IMFData$Obs[[i]]$`#TIME_PERIOD`[j]
NewDataFrame[newrow,'IAD_BP6_USD']<-output_IMFData$Obs[[i]]$`#OBS_VALUE`[j]
newrow<-newrow + 1 # increment down a row
}
}

Control text justification

I am trying to create an input file for another program that is space-delimited. I'm pasting together the contents of multiple columns and having problems when the number have different lengths due to what appears to be a default right-justify in R. For example:
row_id monthly_spend
123 4.55
567 24.64
678 123.09
becomes :
row_id:123 monthly_spend: 4.55
row_id:567 monthly_spend: 24.64
row_id:678 monthly_spend:123.09
while what I need is this:
row_id:123 monthly_spend:4.55
row_id:567 monthly_spend:24.64
row_id:678 monthly_spend:123.09
the code I'm using is derived from this question here and looks like this:
paste(row_id, monthly_spend, sep=":", collapse=" ")
i've tried formatting the columns as numeric or integer without any change.
Any suggestions?
if you put your vectors into a data.frame (if they are not already)
you can use:
apply(sapply(names(myDF), function(x)
paste(x, myDF[, x], sep=":") ), 1, paste, collapse=" ")
# [1] "row_id:123 monthly_spend:4.55"
# [2] "row_id:567 monthly_spend:24.64"
# [3] "row_id:678 monthly_spend:123.09"
or alternatively:
do.call(paste, lapply(names(myDF), function(x) paste0(x, ":", myDF[, x])))
sprintf is also an option. You've got many ways of going about it
sample data used:
myDF <- read.table(header=TRUE, text=
"row_id monthly_spend
123 4.55
567 24.64
678 123.09")
With your data snippet:
df <- read.table(text = "row_id monthly_spend
123 4.55
567 24.64
678 123.09", header = TRUE)
The we can paste together but employ the format function with trim = TRUE to take care of stripping the spaces you don't want:
with(df, paste("row_id:", row_id,
"monthly_spend:", format(monthly_spend, trim = TRUE)))
Which gives:
> with(df, paste("row_id:", row_id,
+ "monthly_spend:", format(monthly_spend, trim = TRUE)))
[1] "row_id: 123 monthly_spend: 4.55" "row_id: 567 monthly_spend: 24.64"
[3] "row_id: 678 monthly_spend: 123.09"
If you need this in a data frame before writing out to file, use:
newdf <- with(df, data.frame(foo = paste("row_id:", row_id,
"monthly_spend:",
format(monthly_spend, trim = TRUE))))
newdf
> newdf
foo
1 row_id: 123 monthly_spend: 4.55
2 row_id: 567 monthly_spend: 24.64
3 row_id: 678 monthly_spend: 123.09
When you write this out, the columns will be justified as you want.
Here is a general answer (any number of variables), assuming your data is in a data.frame dat:
x <- mapply(names(dat), dat, FUN = paste, sep = ":")
write.table(x, file = stdout(),
quote = FALSE, row.names = FALSE, col.names = FALSE)
And you can replace stdout() with a filename.
assuming the data frame is called df
write.table(as.data.frame(sapply(1:ncol(df),FUN=function(x)paste(rep(colnames(df)[x],nrow(df)),df[,x],sep=":"))),"someFileName",row.names=FALSE,col.names=FALSE,sep=" ");
equivalent to following substeps:
# generating the column separated records
df_cp<-sapply(1:ncol(df),FUN=function(x)paste(rep(colnames(df)[x],nrow(df)),df[,x],sep=":"));
### casting to data frame
df_cp<-as.data.frame(df_cp);
### writing out to disk
write.table(df_cp,"someFileName",row.names=FALSE,col.names=FALSE,sep=" ");

Resources