merge a data frame out put of a column - r

I have a data.frame with two variables: ID and Text
I am using the following text analysis command that gives a data.frame output of 48 columns.
analysis <- textstat_readability(mydata$text, measure = c("all"), remove_hyphens = TRUE)
How can I add those 48 columns of results as separate columns in mydata?
Currently I am using the following:
analysis <- cbind(mydata$ID[1:100000], textstat_readability(mydata$text[1:100000], measure = c("all"), remove_hyphens = TRUE))
But it takes forever to finish.

You have 100.000 records with text. Depending on your system and how big each text record is, that might take a while. You could try speeding up the process by using more cores. Most of quanteda's processes run in parallel so it is worth a shot.
Try to do the following to see if that speeds it up:
library(quanteda)
# use all available cores - 1
quanteda_options(threads = parallel::detectCores() - 1)
analyses <- textstat_readability(mydata$text[1:100000], measure = c("all"), remove_hyphens = TRUE)
analyses <- cbind(mydata$text[1:100000], analyses)
Testing this with an a data.frame filled with 2000 times the data_char_sampletext didn't show much difference if you want to do it in one cbind action. But that depends on how big your mydata data.frame already is. It might be better to do it in 2 steps.

Not sure why your approach takes forever to finish to be honest, but the correct way to do it would be the following, I think:
# (0.) Load the package and make a random sample dataset (usually this should be
# provided in the question, just saying):
library(quanteda)
mydata <- data.frame(ID = 1:100,
text = stringi::stri_rand_strings(
n = 100,
length = runif(100, min=1, max=100),
pattern = "[A-Za-z0-9]"),
stringsAsFactors = FALSE)
# 1. Make a quanteda corpus, where the ID is stored alongside the text column:
mydata_corpus <- corpus(mydata, docid_field = "ID", text_field = "text")
# 2. Then run the readability command:
`analysis <- textstat_readability(mydata_corpus, measure = c("all"), remove_hyphens = TRUE)`
# 3. Now you can either keep this, or merge it with your original set based on
# IDs:
mydata_analysis <- merge(mydata, analysis, by.x = "ID", by.y = "document")
This should work without you having to use cbind() at all.

Related

kwic() function returns less rows than it should

I'm currently trying to perform a sentiment analysis on a kwic object, but I'm afraid that the kwic() function does not return all rows it should return. I'm not quite sure what exactly the issue is which makes it hard to post a reproducible example, so I hope that a detailed explanation of what I'm trying to do will suffice.
I subsetted the original dataset containing speeches I want to analyze to a new data frame that only includes speeches mentioning certain keywords. I used the following code to create this subset:
ostalgie_cluster <- full_data %>%
filter(grepl('Schwester Agnes|Intershop|Interflug|Trabant|Trabi|Ostalgie',
speechContent,
ignore.case = TRUE))
The resulting data frame consists of 201 observations. When I perform kwic() on the same initial dataset using the following code, however, it returns a data frame with only 82 observations. Does anyone know what might cause this? Again, I'm sorry I can't provide a reproducible example, but when I try to create a reprex from scratch it just.. works...
#create quanteda corpus object
qtd_speeches_corp <- corpus(full_data,
docid_field = "id",
text_field = "speechContent")
#tokenize speeches
qtd_tokens <- tokens(qtd_speeches_corp,
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE,
padding = FALSE) %>%
tokens_remove(stopwords("de"), padding = FALSE) %>%
tokens_compound(pattern = phrase(c("Schwester Agnes")), concatenator = " ")
ostalgie_words <- c("Schwester Agnes", "Intershop", "Interflug", "Trabant", "Trabi", "Ostalgie")
test_kwic <- kwic(qtd_tokens,
pattern = ostalgie_words,
window = 5)
It's something of a guess without having a reproducible example (your input full_data, namely) but here's my best guess. Your kwic() call is using the default "glob" pattern matching, and what you want is a regular expression match instead.
Fix it this way:
kwic(qtd_tokens, pattern = ostalgie_words, valuetype = "regex",
window = 5

R forecastML package keeps renaming outcome columns

I am trying to use the forecast ML r package to run some tests but the moment I hit this step, it renames the columns
data <- read.csv("C:\\Users\\User\\Desktop\\DG ST Forecast\\LassoTemporalForecast.csv", header=TRUE)
date_frequency <- "1 week"
dates <- seq(as.Date("2012-10-05"), as.Date("2020-10-05"), by = date_frequency)
data_train <- data[1:357,]
data_test <- data[358:429,]
outcome_col <- 1 # The column index of our DriversKilled outcome.
horizons <- c(1,2,3,4,5,6,7,8,9,10,11,12) # 4 models that forecast 1, 1:3, 1:6, and 1:12 time steps ahead.
# A lookback across select time steps in the past. Feature lags 1 through 9, for instance, will be
# silently dropped from the 12-step-ahead model.
lookback <- c(1)
# A non-lagged feature that changes through time whose value we either know (e.g., month) or whose
# value we would like to forecast.
dynamic_features <- colnames(data_train)
data_list <- forecastML::create_lagged_df(data_train,
type = "train",
outcome_col = 1,
horizons = horizons,
lookback = lookback,
date = dates[1:nrow(data_train)],
frequency = date_frequency,
dynamic_features = colnames(data_train)
)
After the data_list, here is a snapshot of what happens in the console:
Next, when I try to create windows following the name change,
windows <- forecastML::create_windows(lagged_df = data_list, window_length = 36,
window_start = NULL, window_stop = NULL,
include_partial_window = TRUE)
plot(windows, data_list, show_labels = TRUE)
this error: Can't subset columns that don't exist. x Column cases doesn't exist.
I've checked through many times based on my input data and the code previously and still can't understand why the name change occurs, if anyone is familiar with this package please assist thank you!
I'm the package author. It's difficult to tell without a reproducible example, but here's what I think is going on: Dynamic features are essentially features with a lag of 0. Dynamic features also retain their original names, as opposed to lagged features which have "_lag_n" appended to the feature name. So by setting dynamic_features to all column names you are getting duplicate columns specifically for the outcome column. My guess is that "cases" is the outcome here. Fix this by removing dynamic_features = colnames(data_train) and setting it to only those features that you really want to have a lag of 0.

Reading ZIP file of machine-written data won't "plot" in RStudio

Summary: Despite a complicated lead-up, the solution was very simple: In order to plot a row of a dataframe as a line instead of a lattice, I needed to transpose the data in order to invert from x obs of y variables to y obs of x variables.
I am using RStudio on a Windows 10 computer.
I am using scientific equipment to write measurements to a csv file. Then I ZIP several files and read to R using read.csv. However, the data frame behaves strangely. Commands "length" and "dim" disagree and the "plot" function throws errors. Because I can create simulated data that doesn't throw the errors, I think the problem is either in how the machine wrote the data or in my loading and processing of the data.
Two ZIP files are located in my stackoverflow repository (with "Monterey Jack" in the name):
https://github.com/baprisbrey/stackoverflow
Here is my code for reading and processing them:
# Unzip the folders
unZIP <- function(folder){
orig.directory <- getwd()
setwd(folder)
zipped.folders <- list.files(pattern = ".*zip")
for (i in zipped.folders){
unzip(i)}
setwd(orig.directory)
}
folder <- "C:/Users/user/Documents/StackOverflow"
unZIP(folder)
# Load the data into a list of lists
pullData <- function(folder){
orig.directory <- getwd()
setwd(folder)
#zipped.folders <- list.files(pattern = ".*zip")
#unzipped.folders <- list.files(folder)[!(list.files(folder) %in% zipped.folders)]
unzipped.folders <- list.dirs(folder)[-1] # Removing itself as the first directory.
oData <- vector(mode = "list", length = length(unzipped.folders))
names(oData) <- str_remove(unzipped.folders, paste(folder,"/",sep=""))
for (i in unzipped.folders) {
filenames <- list.files(i, pattern = "*.csv")
#setwd(paste(folder, i, sep="/"))
setwd(i)
files <- lapply(filenames, read.csv, skip = 5, header = TRUE, fileEncoding = "UTF-16LE") #Note unusual encoding
oData[[str_remove(i, paste(folder,"/",sep=""))]] <- vector(mode="list", length = length(files))
oData[[str_remove(i, paste(folder,"/",sep=""))]] <- files
}
setwd(orig.directory)
return(oData)
}
theData <- pullData(folder) #Load the data into a list of lists
# Process the data into frames
bigFrame <- function(bigList) {
#where bigList is theData is the result of pullData
#initialize the holding list of frames per set
preList <- vector(mode="list", length = length(bigList))
names(preList) <- names(bigList)
# process the data
for (i in 1:length(bigList)){
step1 <- lapply(bigList[[i]], t) # transpose each data
step2 <- do.call(rbind, step1) # roll it up into it's own matrix #original error that wasn't reproduced: It showed length(step2) = 24048 when i = 1 and dim(step2) = 48 501. Any comments on why?
firstRow <- step2[1,] #holding onto the first row to become the names
step3 <- as.data.frame(step2) # turn it into a frame
step4 <- step3[grepl("µA", rownames(step3)),] # Get rid of all those excess name rows
rownames(step4) <- 1:(nrow(step4)) # change the row names to rowID's
colnames(step4) <- firstRow # change the column names to the first row steps
step4$ID <- rep(names(bigList[i]),nrow(step4)) # Add an I.D. column
step4$Class[grepl("pos",tolower(step4$ID))] <- "Yes" # Add "Yes" class
step4$Class[grepl("neg",tolower(step4$ID))] <- "No" # Add "No" class
preList[[i]] <- step4
}
# bigFrame <- do.call(rbind, preList) #Failed due to different number of measurements (rows that become columns) across all the data sets
# return(bigFrame)
return(preList) # Works!
}
frameList <- bigFrame(theData)
monterey <- rbind(frameList[[1]],frameList[[2]])
# Odd behaviors
dim(monterey) #48 503
length(monterey) #503 #This is not reproducing my original error of length = 24048
rowOne <- monterey[1,1:(ncol(monterey)-2)]
plot(rowOne) #Error in plot.new() : figure margins too large
#describe the data
quantile(rowOne, seq(0, 1, length.out = 11) )
quantile(rowOne, seq(0, 1, length.out = 11) ) %>% plot #produces undesired lattice plot
# simulate the data
doppelganger <- sample(1:20461,501,replace = TRUE)
names(doppelganger) <- names(rowOne)
# describe the data
plot(doppelganger) #Successful scatterplot. (With my non-random data, I want a line where the numbers in colnames are along the x-axis)
quantile(doppelganger, seq(0, 1, length.out = 11) ) #the random distribution is mildly different
quantile(doppelganger, seq(0, 1, length.out = 11) ) %>% plot # a simple line of dots as desired
# investigating structure
str(rowOne) # results in a dataframe of 1 observation of 501 variables. This is a correct interpretation.
str(as.data.frame(doppelganger)) # results in 501 observations of 1 variable. This is not a correct interpretation but creates the plot that I want.
How do I convert the rowOne to plot like doppelganger?
It looks like one of my errors is not reproducing, where calls to "dim" and "length" apparently disagree.
However, I'm confused as to why the "plot" function is producing a lattice plot on my processed data and a line of dots on my simulated data.
What I would like is to plot each row of data as a line. (Next, and out of the scope of this question, is I would like to classify the data with adaboost. My concern is that if "plot" behaves strangely then the classifier won't work.)
Any tips or suggestions or explanations or advice would be greatly appreciated.
Edit: Investigating the structure with ("str") of the two examples explains the difference between plots. I guess my modified question is, how do I switch between the two structures to enable plotting a line (like doppelganger) instead of a lattice (like rowOne)?
I am answering my own question.
I am leaving behind the part about the discrepancy between "length" and "dim" since I can't provide a reproducible example. However, I'm happy to leave up for comment.
The answer is that in order to produce my plot, I simply have to transpose the row as follows:
rowOne %>% t() %>% as.data.frame() %>% plot
This inverts the structure from one observation of 501 variables to 501 obs of one variable as follows:
rowOne %>% t() %>% as.data.frame() %>% str()
#'data.frame': 501 obs. of 1 variable:
# $ 1: num 8712 8712 8712 8712 8712 ...
Because of the unusual encoding I used, and the strange "length" result, I failed to see a simple solution to my "plot" problem.

Using value-labels in R with sjlabelled package

Recently I have switched from STATA to R.
In STATA, you have something called value label. Using the command encode for example allows you to turn a string variable into a numeric, with a string label attached to each number. Since string variables contain names (which repeat themselves most of the time), using value labels allows you to save a lot of space when dealing with large dataset.
Unfortunately, I did not manage to find a similar command in R. The only package I have found that could attach labels to my values vector is sjlabelled. It does the attachment but when I’m trying to merge attached numeric vector to another dataframe, the labels seems to “fall of”.
Example: Start with a string variable.
paragraph <- "Melanija Knavs was born in Novo Mesto, and grew up in Sevnica, in the Yugoslav republic of Slovenia. She worked as a fashion model through agencies in Milan and Paris, later moving to New York City in 1996. Her modeling career was associated with Irene Marie Models and Trump Model Management"
install.packages("sjlabelled")
library(sjlabelled)
sentences <- strsplit(paragraph, " ")
sentences <- unlist(sentences, use.names = FALSE)
# Now we have a vector to string values.
sentrnces_df <- as.data.frame(sentences)
sentences <- unique(sentrnces_df$sentences)
group_sentences <- c(1:length(sentences))
sentences <- as.data.frame(sentences)
group_sentences <- as.data.frame(group_sentences)
z <- cbind(sentences,group_sentences)
z$group_sentences <- set_labels(z$group_sentences, labels = (z$sentences))
sentrnces_df <- merge(sentrnces_df, z, by = c('sentences'))
get_labels(z$group_sentences) # the labels I was attaching using set labels
get_labels(sentrnces_df$group_sentences) # the output is just “NULL”
Thanks!
P.S. Sorry about the inelegant code, as I said before, I'm pretty new in R.
source: https://simplystatistics.org/2015/07/24/stringsasfactors-an-unauthorized-biography/
...
Around June of 2007, R introduced hashing of CHARSXP elements in the
underlying C code thanks to Seth Falcon. What this meant was that
effectively, character strings were hashed to an integer
representation and stored in a global table in R. Anytime a given
string was needed in R, it could be referenced by its underlying
integer. This effectively put in place, globally, the factor encoding
behavior of strings from before. Once this was implemented, there was
little to be gained from an efficiency standpoint by encoding
character variables as factor. Of course, you still needed to use
‘factors’ for the modeling functions.
...
I adjusted your initial test data a little bit. I was confused by so many strings and am unsure whether they are necessary for this issue. Let me know, if I missed a point. Here is my adjustment and the answer:
#####################################
# initial problem rephrased
#####################################
# create test data
id = seq(1:20)
variable1 = sample(30:35, 20, replace=TRUE)
variable2 = sample(36:40, 20, replace=TRUE)
df1 <- data.frame(id, variable1)
df2 <- data.frame(id, variable2)
# set arbitrary labels
df1$variable1 <- set_labels(df1$variable1, labels = c("few" = 1, "lots" = 5))
# show labels in this frame
get_labels(df1)
# include associated values
get_labels(df1, values = "as.prefix")
# merge df1 and df2
df_merge <- merge(df1, df2, by = c('id'))
# labels lost after merge
get_labels(df_merge, values = "as.prefix")
#####################################
# solution with dplyr
#####################################
library(dplyr)
df_merge2 <- left_join(x = df1, y = df2, by = "id")
get_labels(df_merge2, values = "as.prefix")
Solution attributed to:
Merging and keeping variable labels in R

write a for loop to automatically create subsets of datasets in r

Please help me as I am new to R and also programming
I am trying to write a loop in such that it should read the data for every 1000 rows and create a data-set in r
Following is my trial
for(i in 0:nl){
df[i] = fread('RM.csv',skip = 1000*i, nrows =1000,
col.names = colnames(read.csv('RM.csv', nrow=1, header = T)))
}
where nl is a integer and is equal to length of data 'RM.csv'
What I am trying to do is create a function which will skip every 1000 rows and read next 1000 rows and terminates once it reaches nl which is length of original data.
Now it is not mandatory to use only this approach.
You can try reading in the entire file into a single data frame, and then subsetting off the rows you don't want:
df <- read.csv('RM.csv', header=TRUE)
y <- seq(from = 0, to = 100000, by = 1) # replace the 'to' value with a value
seq.keep <- y[floor(y / 1000) %% 2 == 0] # large enough for the whole file
df.keep <- df[seq.keep, ]
Here is a rather messy demo which shows that the above sequence logic be correct:
Demo
You can inspect that the sequence generated is:
0-999
2000-2999
4000-4999
etc.
As mentioned in the code comment, make sure you generate a sequence large enough to accommodate the actual size of the data frame.
If you need to continue with your current approach, then try reading in only every other 1000 lines, e.g.
sq <- seq(from=0, to=nl, by=2)
names <- colnames(read.csv('RM.csv', nrow=1, header=TRUE))
for(i in sq) {
df_i <- fread('RM.csv', skip=1000*i, nrows=1000, col.names=names)
# process this chunk and move on
}

Resources