Say I have two files, file1.txt and file2.txt that looks like this:
file1.txt
blablabla
lorem ipsum
year: 2007
Jan Feb Mar
1 2 3
4 5 6
file2.txt
blablabla
lorem ipsum
year: 2008
Jan Feb Mar
7 8 9
10 11 12
I can read these files with purrr::map_df(read_table,skip=3)
But what I want to do is extract the year from each file and assign it on a new year column so that my final dataframe looks like this:
Jan Feb Mar Year
1 2 3 2007
4 5 6 2007
7 8 9 2008
10 11 12 2008
I am looking somewhere in the line of using readr::read_lines first then readr::read_table using rlang::exec but don't know how exactly to do this.
Base R implements streaming connections with readLines:
f <- function(path) {
## Open connection and close on exit
zzz <- file(path, open = "rt")
on.exit(close(zzz))
## Read first three lines into character vector and extract year
y <- as.integer(gsub("\\D", "", readLines(zzz, n = 3L)[3L]))
## Read remaining lines into data frame
d <- read.table(zzz, header = TRUE)
d$Year <- y
d
}
nms <- c("file1.txt", "file2.txt")
do.call(rbind, lapply(nms, f))
Jan Feb Mar Year
1 1 2 3 2007
2 4 5 6 2007
3 7 8 9 2008
4 10 11 12 2008
It's not clear to me that readr has this functionality:
library("readr")
zzz <- file("file1.txt", open = "rb")
read_lines(zzz, skip = 2L, n_max = 1L)
## [1] "year: 2007"
read_table(zzz)
## # A tibble: 0 × 0
close(zzz)
Even though we only asked read_lines for the third line of file1.txt, it seems to have (invisibly) read all of the lines, leaving nothing for read_table.
On the other hand, this GitHub issue was "fixed" last year, so it is strange not to see support for streaming connections in the latest release version of readr. Maybe I'm missing something...?
One other solution is use the id argument in read_csv which in your example creates a new column with the file name (eg. "file1.txt") to show which file each row came from.
Note you don't need to use map_df, you can directly pass the file name to read_csv() and it will apply the read_csv to each file and compile to a dataframe.
From there, you can create new dataframe with read_csv of just the 3rd row eg(year:2007) again with the id argument and this time use skip and n_max arguments so that you only pull in the row with "year:2007".
With those two data frames you can then left join based on the column you set with the id argument to pull in the that row!
You will need to extract out the "year" text, which be easily done with the str_extract() argument.
df_missing_year <- readr::read_csv(file=file_path, id="source", skip=3)
df_year_only <- readr::read_csv(file=file_path, id="source",skip=2, n_max=1)
df_complete <- dplyr::left_join(x=df_missing_year, y=df_year_only, by="source")
If you found this helpful, please consider up voting or selecting it has the answer.
I have a dataset full of IDs and qualification strings. My issue with this is two fold;
How to deal with splits between different symbols and,
how to iterate output down a dataframe whilst retaining an ID.
ID <- c(1,2,3)
Qualstring <- c("LE:Science = 45 Distinctions",
"A:Chemistry = A A:Biology = A A:Mathematics = A",
"A:Biology = A A:Chemistry = A A:Mathematics = A B:Baccalaureate Advanced Diploma = Pass"
)
s <- data.frame(ID, Qualstring)
The desired output would be:
ID Qualification Subject Grade
1 1 LE: Science 45 Distinctions
2 2 A: Chemistry A
3 2 A: Biology A
4 2 A: Mathematics A
5 3 A: Biology A
6 3 A: Chemistry A
7 3 A: Mathematics A
8 3 WB: Welsh Baccalaureate Advanced Diploma Pass
The commonality of the splits is the ":" and "=", and the codes/words around those.
Looking at the problem from my perspective, it appears complex and whether a continued fudge in excel is ultimately the way to go for this structure of data. Would love to know otherwise if there are any recommendations or direction.
A solution using data.table and stringr. The use of data.table is just for my personal convenience, you could use data.frame with do.call(rbind,.) instead of rbindlist()
library(stringr)
qual <- str_extract_all(s$Qualstring,"[A-Z]+(?=\\:)")
subject <- str_extract_all(s$Qualstring,"(?<=\\:)[\\w ]+")
grade <- str_extract_all(s$Qualstring,"(?<=\\= )[A-z0-9]+")
library(data.table)
df <- lapply(seq(s$ID),function(i){
N = length(qual[[i]])
data.table(ID = rep(s[i,"ID"],N),
Qualification = qual[[i]],
Subject = subject[[i]],
Grade = grade[[i]]
)
}) %>% rbindlist()
ID Qualification Subject Grade
1: 1 LE Science 45
2: 2 A Chemistry A
3: 2 A Biology A
4: 2 A Mathematics A
5: 3 A Biology A
6: 3 A Chemistry A
7: 3 A Mathematics A
8: 3 B Baccalaureate Advanced Diploma Pass
In short, I use positive look behind (?<=) and positive look ahead (?=). [A-Z]+ is for a group of upper letters, [\\w ]+ for a group of words and spaces, [A-z0-9]+ for letters (up and low cases) and numbers. string_extract_all gives a list with all the match on each cell of the character vector tested.
I am very new to NLP. Please, don't judge me strictly.
I have got a very big data-frame on customers' feedback, my goal is to analyze feedbacks. I tokenized words in feedbacks, deleted stop-words (SMART). Now, I need to receive a table of most and less frequent used words.
The code looks like this:
library(tokenizers)
library(stopwords)
words_as_tokens <-
tokenize_words(dat$description,
stopwords = stopwords(language = "en", source = "smart"))
The dataframe looks like this: there are lots of feedbacks (variable "description") and customers by whom the feedbacks were given (each customer is not unique, they can be repeated). I want to receive a table with 3 columns: a) customer name b) word c) its frequency. This "ranking" should be in a decreasing order.
Try this
library(tokenizers)
library(stopwords)
library(tidyverse)
# count freq of words
words_as_tokens <- setNames(lapply(sapply(dat$description,
tokenize_words,
stopwords = stopwords(language = "en", source = "smart")),
function(x) as.data.frame(sort(table(x), TRUE), stringsAsFactors = F)), dat$name)
# tidyverse's job
df <- words_as_tokens %>%
bind_rows(, .id = "name") %>%
rename(word = x)
# output
df
# name word Freq
# 1 John experience 2
# 2 John word 2
# 3 John absolutely 1
# 4 John action 1
# 5 John amazon 1
# 6 John amazon.ae 1
# 7 John answering 1
# ....
# 42 Alex break 2
# 43 Alex nice 2
# 44 Alex times 2
# 45 Alex 8 1
# 46 Alex accent 1
# 47 Alex africa 1
# 48 Alex agents 1
# ....
Data
dat <- data.frame(name = c("John", "Alex"),
description = c("Unprecedented. The perfect word to describe Amazon. In every positive sense of that word! All because of one man - Jeff Bezos. What an entrepreneur! What a vision! This is from personal experience. Let me explain. I had given up all hope, after a horrible experience with Amazon.ae (formerly Souq.com) - due to a Herculean effort to get an order cancelled and the subsequent refund issued. I have never faced such a feedback-resistant team in my life! They were robotically answering my calls and sending me monotonous, unhelpful emails, followed by absolutely zero action!",
"Not only does Amazon have great products but their Customer Service for the most part is wonderful. Although most times you are outsourced to a different country, I personally have found that when I call it's either South Africa or Philippines and they speak so well, understand me and my NY accent and are quite nice. Let’s face it. Most times you are calling CS with a problem or issue. These agents have to listen to 8 hours of complaints so they themselves need a break. No matter how annoyed I am I try to be on my best behavior and as nice as can be because they too need a break with how nasty we as a society can be."), stringsAsFactors = F)
You can try with quanteda as well as follows:
library(quanteda)
library(quanteda.textstats)
# define a corpus object to store your initial documents
mycorpus = corpus(dat$description)
# convert the corpus to a Document-Feature Matrix
mydfm = dfm( mycorpus,
tolower = TRUE,
remove = stopwords(), # this removes English stopwords
remove_punct = TRUE, # this removes punctuation
remove_numbers = TRUE, # this removes digits
remove_symbol = TRUE, # this removes symbols
remove_url = TRUE ) # this removes urls
# calculate word frequencies and return a data.frame
word_frequencies = textstat_frequency( mydfm )
I'm new to R so please forgive the repetitive question. I was trying to do this in Access (I know) but unfortunately the application kept crashing.
I have a dataframe object that contains 78k records that I imported from a CSV, and it should form a tree like structure, while there may not be a natural root however as this is a subset of the entire org.
POS_NUM|TITLE|REPORT_TO_POS_NUM
1234 Bob 789
5698 Jim 1234
8976 Frank 1653
This should for a loose relationship tree relationship
Bob
\ Jim
Frank
Essentially I need this to calculate the number of sub reports for each person, the number direct reports, as well as some other recursive functions
EDIT
Right now I'm attempting to simply loop through my table
treeDataOne <- read.csv(file="File1.csv", header=TRUE, stringsAsFactors=FALSE sep=",")
treeDataTwo <- read.csv(file="File2.csv",header=TRUE, stringsAsFactors=FALSE, sep=",") #Same columns, different data
treeDataAll <- rbind(treeDataOne, treeDataTwo) #Merge data, this seems to work
#Adding new columns to store data
treeDataAll['DIRECT_REPORTS'] <- 0
treeDataAll['INDIRECT_REPORTS'] <- 0
treeDataAll['DIVISION'] <- ""
treeDataAll['BRANCH'] <- ""
treeDataAll['PROCESSED'] <- FALSE
I'm now trying to iterate over every record and calculate the direct reports
So I'm pseudo code it should be:
for i in treeDataAll{
i.DIRECT_REPORTS = nrow(where REPORT_TO_POS_NUM = i.pos_num)
}
library(data.table)
setDT(treeDataAll)
funky <- function(x){
nrow(treeDataAll[REPORT_TO_POS_NUM == x])
}
treeDataAll[, DIR_REPORTS := funky(POS_NUM), by = POS_NUM]
treeDataAll[]
# POS_NUM TITLE REPORT_TO_POS_NUM DIR_REPORTS
# 1: 1234 Bob 789 1
# 2: 5698 Jim 1234 0
# 3: 8976 Frank 1653 0
I have an excel file that has multiple sheets. each sheet looks like this with some excess data at the bottom
A B C D....
1 time USA USA USA
2 MD CA PX
3 pork peas nuts
4 jan-11 4 2 2
5 feb-11 4 9 3
6 mar-11 8 8 3
.
.
workbook1|workbook2.....
The file is 11 mb, but when I try to use
sheet<-readWorksheetFromFile("excelfile.xlsx", sheet = 1)
I get
Error: OutOfMemoryError (Java): Java heap space
For each work sheet the data takes up different number for rows and columns, I want to write something that produces this for each sheet.
I am trying to convert each column into
country state product unit time
USA MD pork 3 jan-11
USA MD pork 3 feb-11
USA MD pork 3 mar-11
...
..
.
Is there any way to do this in R?
If your spreadsheet is full of formulas, you might need to convert those to values to get them to be read in easily. Otherwise, I would suggest using a tool like this one (among others out there) to convert all the sheets in a workbook to CSV files and work from there.
If you've gotten that far, here's something that can be tried for the "reshaping" part of your question. Here, we'll assume that "A" actually represents a CSV file, the contents of which are the six lines shown as sample data in your question:
## Create some sample data
A <- tempfile()
writeLines(sep="\n", con = A,
text = c("time, USA, USA, USA",
", MD, CA, PX",
", pork, peas, nuts",
"jan-11, 4, 2, 2",
"feb-11, 4, 9, 3",
"mar-11, 8, 8, 3"))
The first thing I would do is read in the headers and the data separately. To read the headers separately, use nrows to specify the number of rows that contain the header information. To read the data separately, specify skip to skip the header rows.
B <- read.csv(A, header = FALSE, skip = 3, strip.white = TRUE)
Bnames <- read.csv(A, header = FALSE, nrows = 3, strip.white = TRUE)
Use apply to paste the header rows together to form the names for the resulting data.frame:
names(B) <- apply(Bnames, 2, function(x) paste(x[x != ""], collapse = "_"))
B
# time USA_MD_pork USA_CA_peas USA_PX_nuts
# 1 jan-11 4 2 2
# 2 feb-11 4 9 3
# 3 mar-11 8 8 3
Now comes the part of converting the data from a "wide" to a "long" format. There are many ways to do this, some using base R too, but the most direct is to use melt and colsplit from the "reshape2" package:
library(reshape2)
BL <- melt(B, id.vars="time")
cbind(BL[c("time", "value")],
colsplit(BL$variable, "_",
c("country", "state", "product")))
# time value country state product
# 1 jan-11 4 USA MD pork
# 2 feb-11 4 USA MD pork
# 3 mar-11 8 USA MD pork
# 4 jan-11 2 USA CA peas
# 5 feb-11 9 USA CA peas
# 6 mar-11 8 USA CA peas
# 7 jan-11 2 USA PX nuts
# 8 feb-11 3 USA PX nuts
# 9 mar-11 3 USA PX nuts
Unfortunately, XLConnect is unlikely to work in your application. I can confirm that on a system with 8GB RAM, running Win 7 64bit and 64bit R 3.0.2, XLConnect fails with a 22MB .xlsx file, with the same error that you are getting. As #Ista pointed out, and as explained here, after restarting R and before doing anything else:
options(java.parameters = "-Xmx4096m")
library(XLConnect)
wb <- loadWorkbook("myWorkBook.xlsx")
sheet <- readWorksheet(wb,"Data")
avoids the error. However, the import still takes more than an hour(!!).
In contrast, as #Gaffi pointed out, once the sheet "Data" is saved to a csv file (~7MB), it can be imported as follows:
library(data.table)
system.time(sheet <- fread("Data.csv"))
user system elapsed
0.84 0.00 0.86
in less than 1 second. In my test case sheet has 6 columns and ~376,000 rows.
Sorry about this "second answer", but you really had two questions... #Ananda's solution for reshaping your data is extremely elegant. This is just another way to think about it.
If you transpose the input matrix you get a new matrix, where the first column is country, the second column is city, the third column is "type" (for lack of a better term), and the actual data is in the other columns (so, there is one additional column for every "time").
So a different approach is to transpose first and then melt the new matrix. This avoids creating all the concatenated column names and splitting them back later. The problem is that melt.data.frame is exceptionally inefficient with a very large number of columns (which you would have here). So doing it this way would bbe 10X slower than #Ananda's approach.
A solution is to use melt.array (just call melt(...) with an array rather than a data frame). As shown below, this approach is ~20X faster, with larger datasets (yours was 11MB).
library(reshape) # for melt(...)
library(microbenchmark) # for microbenchmark(...)
# this is just to model your situation with more realistic size
# create a large data frame (250 columns of country, city, type; 1000 rows of time)
df <- rep(c("USA","UK","FR","CHN","GER"),each=50) # time + 250 columns
df <- rbind(df,rep(c(c("NY","SF","CHI","BOS","LA")),each=10))
df <- rbind(df,rep(c("pork","peas","nuts","fruit","other")))
df <- rbind(df,matrix(sample(1:1000,250*1000,replace=T),ncol=250))
df <- cbind(c("time","","",
as.character(as.Date(1:1000,origin="2010-01-01"))),df)
df <- data.frame(df) # big warning here about duplicated row names; not important
# #Ananda'a approach:
transform.orig <- function(df){
B <- df[-(1:3),]
Bnames <- df[1:3,]
names(B) <- apply(Bnames, 2, function(x) paste(x[x != ""], collapse = "_"))
BL <- melt(B, id.vars="time")
final <- cbind(BL[c("time", "value")],
colsplit(BL$variable, "_",
c("country", "state", "product")))
return(final)
}
# transpose approach:
transform.new <- function(df) {
zz <- t(df)
times <- t(zz[1,4:ncol(zz)])
colnames(zz) <- c("country","city","type", times)
data <- melt(zz[-1,-(1:3)],varnames=c("id","time"))
final <- cbind(country=rep(zz[-1,1],each=ncol(zz)-3),
city =rep(zz[-1,2],each=ncol(zz)-3),
type =rep(zz[-1,3],each=ncol(zz)-3),
data[,-1])
return(final)
}
# benchmark
microbenchmark(transform.orig(df),transform.new(df), times=5, unit="s")
Unit: seconds
expr min lq median uq max neval
transform.orig(df) 9.2511679 9.6986330 9.889457 10.1518191 10.3354328 5
transform.new(df) 0.4383197 0.4724145 0.474212 0.5815531 0.6886383 5
For reading the data from excel, try the openxlsx package. It uses c++ instead of java, and better handles larger excel files.
To reshape your data look at the tidyr package. The gather function could help you out.