Tie melted table object back to original dataframe? - r

I am trying to count the number of times each word in a row in a dataframe occurs at a given time. Here is my dataframe:
library(stringr)
df <- data.frame("Corpus" = c("this is some text",
"here is some more text text",
"more food for everyone",
"less for no one",
"something text here is some more text",
"everyone should go home",
"more random text",
"random text more more more",
"plenty of random text",
"the final piece of random everyone text"),
"Class" = c("X", "Y", "Y", "Y", "Y",
"Y", "Y", "Z",
"Z", "Z"),
"OpenTime" = c("12/01/2016 10:45:00", "11/07/2016 10:32:00",
"11/15/2015 01:45:00", "08/23/2012 1:23:00",
"12/17/2016 11:45:00", "12/16/2016 9:47:00",
"04/11/2015 04:23:00", "11/27/2016 12:12:00",
"08/25/2015 10:46:00", "09/27/2016 10:46:00"))
I am trying to get this result:
Class OpenTime Word Frequency
X 12/01/2016 10:45:00 this 1
X 12/01/2016 10:45:00 is 1
X 12/01/2016 10:45:00 some 1
X 12/01/2016 10:45:00 text 1
Y 11/07/2016 10:32:00 here 1
Y 11/07/2016 10:32:00 is 1
Y 11/07/2016 10:32:00 some 1
Y 11/07/2016 10:32:00 more 1
Y 11/07/2016 10:32:00 text 2
...
I'd love to do this all with groupby in dplyr, but I haven't yet got that to work. Instead, this is what I've tried:
splits <- strsplit(as.character(df$Corpus), split = " ")
counts <- lapply(splits, table)
counts.melted <- lapply(counts, melt)
This gives me the transposed view I want:
> counts.melted
[[1]]
Var1 value
1 is 1
2 some 1
3 text 1
4 this 1
[[2]]
Var1 value
1 here 1
2 is 1
3 more 1
4 some 1
5 text 1
...
But how can I tie that list of melted vectors back with the original data to produce the desired output above? I tried using rep to repeat the the Class value for as many words there were in each row, but have had little success. It would be easy to do all of this in a for loop, but I would much rather do this using vectorised methods like lapply.
out.df <- data.frame("RRN" = NULL, "OpenTime" = NULL,
"Word" = NULL, "Frequency" = NULL)

For those coming here in the future, I was able to vectorize most of the solution to my problem. Unfortunately, I'm still looking for ways to use lapply instead of the for loop below, but this does exactly what I want:
# split each row in the corpus column on spaces
splits <- strsplit(as.character(df$Corpus), split = " ")
# count the number of times each word in a row appears in that row
counts <- lapply(splits, table)
# melt that table to make things more palatable
counts.melted <- lapply(counts, melt)
# the result data frame to which we'll append our results
out.df <- data.frame("Class" = c(), "OpenTime" = c(),
"Word" = c(), "Frequency" = c())
# it would be better to vectorize this, using something like lapply
for(idx in 1:length(counts.melted)){
# coerce the melted table at that index to a data frame
count.df <- as.data.frame(counts.melted[idx])
# change the column names
names(count.df) <- c("Word", "Frequency")
# repeat the Classand time for that row to fill in those column
count.df[, 'Class'] <- rep(as.character(df[idx, "Class"]), nrow(count.df))
count.df[, 'OpenTime'] <- rep(as.character(df[idx, "OpenTime"]), nrow(count.df))
# append the results
out.df <- rbind(out.df, count.df)
}

Related

How to split a sentence in two halves in R

I have a vector of string, and I want each string to be cut roughly in half, at the nearest space.
For exemple, with the following data :
test <- data.frame(init = c("qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk",
"qsdf",
"mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll",
"qsddddddddddddddddddddddddddddddd",
"qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj"), stringsAsFactors = FALSE)
I want to get something like this :
first sec
1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk
2 qsdf
3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd
5 lmj mjjmjmjm lkj lmj mjjmjmjm lkj
Any solution that does not cut in halves but "so that the first part isn't longer than X character" would be also great.
First, we split the strings by spaces.
a <- strsplit(test$init, " ")
Then we find the last element of each vector for which the cumulative sum of characters is lower than half the sum of all characters in the vector:
b <- lapply(a, function(x) which.max(cumsum(cumsum(nchar(x)) <= sum(nchar(x))/2)))
Afterwards we combine the two halfs, substituting NA if the vector was of length 1 (only one word).
combined <- Map(function(x, y){
if(y == 1){
return(c(x, NA))
}else{
return(c(paste(x[1:y], collapse = " "), paste(x[(y+1):length(x)], collapse = " ")))
}
}, a, b)
Finally, we rbind the combined strings and change the column names.
newdf <- do.call(rbind.data.frame, combined)
names(newdf) <- c("first", "second")
Result:
> newdf
first second
1 qsdf mqsldkfop mqsdfmlk lksdfp pqpdfm mqsdfmj mlk
2 qsdf <NA>
3 mp mlksdfm mkmlklkjjjjjjjjjjjjjjjjjjjjjjklmmjlkjll
4 qsddddddddddddddddddddddddddddddd <NA>
5 qsdfmlk mlk mkljlmkjlmkjml lmj mjjmjmjm lkj
You can use the function nbreak from the package that I wrote:
devtools::install_github("igorkf/breaker")
library(tidyverse)
test <- data.frame(init = c("Phrase with four words", "That phrase has five words"), stringsAsFactors = F)
#This counts the numbers of words of each row:
nwords = str_count(test$init, " ") + 1
#This is the position where break the line for each row:
break_here = ifelse(nwords %% 2 == 0, nwords/2, round(nwords/2) + 1)
test
# init
# 1 Phrase with four words
# 2 That phrase has five words
#the map2_chr is applying a function with two arguments,
#the string is "init" and the n is "break_here":
test %>%
mutate(init = map2_chr(init, break_here, ~breaker::nbreak(string = .x, n = .y, loop = F))) %>%
separate(init, c("first", "second"), sep = "\n")
# first second
# 1 Phrase with four words
# 2 That phrase has five words

How do I replace values in a matrix from an uploaded CSV file in R?

These are the steps I took:
1) Read in CSV file
rawdata <- read.csv('name of my file', stringsAsFactors=FALSE)
2) Cleaned my data by removing certain records based on x-criteria
data <- rawdata[!(rawdata$YOURID==""), all()]
data <- data[(data$thiscolumn=="right"), all()]
data <- data[(data$thatcolumn=="right"), all()]
3) Now I want to replace certain values throughout the whole matrix with a number (replace a string with a number value). I have tried the following commands and nothing works (I've tried gsub and replace):
gsub("Not the right string", 2, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
data <- replace(data, data$thiscolumn == "Not the right string" , 2)
gsub("\\Not the right string", "2", data$thiscolumn, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
I am new to R. I normally code in C++. The only other thing for me to try is a for loop. I potentially might only want R to look at certain columns for replace certain values, but I'd prefer a search through the whole matrix. Either is fine.
These are the guidelines per R Help:
sub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
replace(x, list, values)
Arguments
x vector
list an index vector
values replacement values
Example: I want to replace the text "Extremely Relevant 5" or whatever x-text, with a corresponding number value.
You can substitute the for loop by using logical indexing. First you need to identify the indices of what you want to replace, then assign the new value for these indices.
Here's small example. Let's say we have this vector:
x <- c(1, 2, 99, 4, 2, 99)
# x
# [1] 1 2 99 4 2 99
And we want to find all places where it's 99 and replace it with 0. when you apply x == 99 you get a TRUE and FALSE vector.
x == 99
# [1] FALSE FALSE TRUE FALSE FALSE TRUE
You can use this vector as an index to assign the new value where the condition is met.
x[x == 99] <- 0
# x
# [1] 1 2 0 4 2 0
Similarly you can use this approach to apply it across a dataframe or a matrix in a one-shot
df <- data.frame(col1 = c(2, 99, 3), col2 = c(99, 4, 99))
# df:
# col1 col2
# 1 2 99
# 2 99 4
# 3 3 99
df[df==99] <- 0
# df
# col1 col2
# 1 2 0
# 2 0 4
# 3 3 0
For dataframe with strings, it might be trickier since the column can be factor and the value you're trying to replace is not one of the levels. You can go around that by changing it to character and apply the replacement.
> df <- data.frame(col1 = c(2, "this string", 3), col2 = c("this string", 4, "this string"))
> df
col1 col2
1 2 this string
2 this string 4
3 3 this string
> sapply(df, class)
col1 col2
"factor" "factor"
> df <- sapply(df, as.character)
> df
col1 col2
[1,] "2" "this string"
[2,] "this string" "4"
[3,] "3" "this string"
> df[df == "this string"] <- 0
> df <- as.data.frame(df)
> df
col1 col2
1 2 0
2 0 4
3 3 0
I have found a few solutions to my own questions I thought I'd share in just working a little more out just now.
1) I had to add the package "library(stringr)" at the top so that R can understand matching strings.
2) I used a for loop to go down the entries of a specific column I wanted in my Matrix to change to the value indicated. See as follows:
`#possible solution 5 - This totally works!
for (i in 1:nrow(data)){
if (data$columnofinterest[i] == "String of Interest")
data$columnofinterest[i] <- "Becca is da bomb dot com"
}`
`#possible solution 6 - This totally works!
for (i in 1:nrow(data)){
if (data$columnofinterest[i] == "Becca is da bomb dot com")
data$columnofinterest[i] <- 7
}`
As you can see replacing specific records between text and a numerical value is
possible (text to numerical value and vice versa). And as the comments indicate it took me till the 5 and 6 problem solution to figure this much out. Still not the whole Matrix, but at least I can go through column of interest at a time, which is
still a lot faster.`
Here's a dplyr/tidyverse solution adapted from changing multiple column values given a condition in dplyr. You can use mutate_all:
library(tidyverse)
data <- tibble(a = c("don't change", "change", "don't change"),
b = c("change", "Change", "don't change"))
data %>%
mutate_all(funs(if_else(. == "change", "xxx", .)))

Use a vector/index as a row name in a dataframe using rbind

I think I'm missing something super simple, but I seem to be unable to find a solution directly relating to what I need: I've got a data frame that has a letter as the row name and a two columns of numerical values. As part of a loop I'm running I create a new vector (from an index) that has both a letter and number (e.g. "f2") which I then need to be the name of a new row, then add two numbers next to it (based on some other section of code, but I'm fine with that). What I get instead is the name of the vector/index as the title of the row name, and I'm not sure if I'm missing a function of rbind or something else to make it easy.
Example code:
#Data frame and vector creation
row.names <- letters[1:5]
vector.1 <- c(1:5)
vector.2 <- c(2:6)
vector.3 <- letters[6:10]
data.frame <- data.frame(vector.1,vector.2)
rownames(data.frame) <- row.names
data.frame
index.vector <- "f2"
#what I want the data frame to look like with the new row
data.frame <- rbind(data.frame, "f2" = c(6,11))
data.frame
#what the data frame looks like when I attempt to use a vector as a row name
data.frame <- rbind(data.frame, index.vector = c(6,11))
data.frame
#"why" I can't just type "f" every time
index.vector2 = paste(index.vector, "2", sep="")
data.frame <- rbind(data.frame, index.vector2 = c(6,11))
data.frame
In my loop the "index.vector" is a random sample, hence where I can't just write the letter/number in as a row name, so need to be able to create the row name from a vector or from the index of the sample.
The loop runs and a random number of new rows will be created, so I can't specify what number the row is that needs a new name - unless there's a way to just do it for the newest or bottom row every time.
Any help would be appreciated!
Not elegant, but works:
new_row <- data.frame(setNames(list(6, 11), colnames(data.frame)), row.names = paste(index.vector, "2", sep=""))
data.frame <- rbind(data.frame, new_row)
data.frame
# vector.1 vector.2
# a 1 2
# b 2 3
# c 3 4
# d 4 5
# e 5 6
# f22 6 11
I Understood the problem , but not able to resolve the issue. Hence, suggesting an alternative way to achieve the same
Alternate solution: append your row labels after the data binding in your loop and then assign the row names to your dataframe at the end .
#Data frame and vector creation
row.names <- letters[1:5]
vector.1 <- c(1:5)
vector.2 <- c(2:6)
vector.3 <- letters[6:10]
data.frame <- data.frame(vector.1,vector.2)
#loop starts
index.vector <- "f2"
data.frame <- rbind(data.frame,c(6,11))
row.names<-append(row.names,index.vector)
#loop ends
rownames(data.frame) <- row.names
data.frame
output:
vector.1 vector.2
a 1 2
b 2 3
c 3 4
d 4 5
e 5 6
f2 6 11
Hope this would be helpful.
If you manipulate the data frame with rbind, then the newest elements will always be at the "bottom" of your data frame. Hence you could also set a single row name by
rownnames(data.frame)[nrow(data.frame)] = "new_name"

R: extract and paste keyword matches

I am new to R and have been struggling with this one. I want to create a new column, that checks if a set of any of words ("foo", "x", "y") exist in column 'text', then write that value in new column.
I have a data frame that looks like this: a->
id text time username
1 "hello x" 10 "me"
2 "foo and y" 5 "you"
3 "nothing" 15 "everyone"
4 "x,y,foo" 0 "know"
The correct output should be:
a2 ->
id text time username keywordtag
1 "hello x" 10 "me" x
2 "foo and y" 5 "you" foo,y
3 "nothing" 15 "everyone" 0
4 "x,y,foo" 0 "know" x,y,foo
I have this:
df1 <- data.frame(text = c("hello x", "foo and y", "nothing", "x,y,foo"))
terms <- c('foo', 'x', 'y')
df1$keywordtag <- apply(sapply(terms, grepl, df1$text), 1, function(x) paste(terms[x], collapse=','))
Which works, but crashes R when my needleList contains 12k words and my text has 155k rows. Is there a way to do this that won't crash R?
This is a variation on what you have done, and what was suggested in the comments. This uses dplyr and stringr. There may be a more efficient way but this may not crash your R session.
library(dplyr)
library(stringr)
terms <- c('foo', 'x', 'y')
term_regex <- paste0('(', paste(terms, collapse = '|'), ')')
### Solution: this uses dplyr::mutate and stringr::str_extract_all
df1 %>%
mutate(keywordtag = sapply(str_extract_all(text, term_regex), function(x) paste(x, collapse=',')))
# text keywordtag
#1 hello x x
#2 foo and y foo,y
#3 nothing
#4 x,y,foo x,y,foo

issue with reading and writing a csv file in R language

I have a table in csv format, the data is the following:
1 3 1 2
1415_at 1 8.512147859 8.196725061 8.174426394 8.62388149
1411_at 2 9.119200527 9.190318548 9.149239039 9.211401637
1412_at 3 10.03383593 9.575728316 10.06998673 9.735217522
1413_at 4 5.925999419 5.692092375 5.689299161 7.807354922
When I read it with:
m <- read.csv("table.csv")
and print the values of m, I notice that they change to:
X X.1 X1 X3 X1.1 X4
1 1415_at 1 8.512148 8.196725 8.174426 8.623881
I made some manipulation to keep only those columns that are labelled 1 or 2, so I do that with:
smallerdat <- m[ grep("^X$|^X.1$|^X1$|^X2$|1\\.|2\\." , names(m) ) ]
write.csv(smallerdat,"table2.csv")
it writes me the file with those annoying headers and that first column added, which I do not need it:
X X.1 X1 X1.1 X2
1 1415_at 1 8.512148 8.174426 8.623881
so when I open that data in Excel the headers are still X, X.1 and son on. What I need is that the headers remain the same as:
1 1 2
1415_at 1 8.196725061 8.174426394 8.62388149
any help?
Please notice also that first column that is added automatically, I do not need it, so how I can get rid that of that column?
There are two issues here.
For reading your CSV file, use:
m <- read.csv("table.csv", check.names = FALSE)
Notice that by doing this, though, you can't use the column names as easily. You have to quote them with backticks instead, and will most likely still run into problems because of duplicated column names:
m$1
# Error: unexpected numeric constant in "mydf$1"
mydf$`1`
# [1] 8.512148 9.119201 10.033836 5.925999
For writing your "m" object to a CSV file, use:
write.csv(m, "table2.csv", row.names = FALSE)
After reading your file in using the method in step 1, you can subset as follows. If you wanted the first column and any columns named "3" or "4", you can use:
m[names(m) %in% c("", "3", "4")]
# 3 4
# 1 1415_at 1 8.196725 8.623881
# 2 1411_at 2 9.190319 9.211402
# 3 1412_at 3 9.575728 9.735218
# 4 1413_at 4 5.692092 7.807355
Update: Fixing the names before using write.csv
If you don't want to start from step 1 for whatever reason, you can still fix your problem. While you've succeeded in taking a subset with your grep statement, that doesn't change the column names (not sure why you would expect that it should). You have to do this by using gsub or one of the other regex solutions.
Here are the names of the columns with the way you have read in your CSV:
names(m)
# [1] "X" "X.1" "X1" "X3" "X1.1" "X2"
You want to:
Remove all "X"s
Remove all ".some-number"
So, here's a workaround:
# Change the names in your original dataset
names(m) <- gsub("^X|\\.[0-9]$", "", names(m))
# Create a temporary object to match desired names
getme <- names(m) %in% c("", "1", "2")
# Subset your data
smallerdat <- m[getme]
# Reassign names to your subset
names(smallerdat) <- names(m)[getme]
I am not sure I understand what you are attempting to do, but here is some code that reads a csv file with missing headers for the first two columns, selects only columns with a header of 1 or 2 and then writes that new data file retaining the column names of 1 or 2.
# first read in only the headers and deal with the missing
# headers for columns 1 and 2
b <- readLines('c:/users/Mark W Miller/simple R programs/missing_headers.csv',
n = 1)
b <- unlist(strsplit(b, ","))
b[1] <- 'name1'
b[2] <- 'name2'
b <- gsub(" ","", b, fixed=TRUE)
b
# read in the rest of the data file
my.data <- (
read.table(file = "c:/users/mark w miller/simple R programs/missing_headers.csv",
na.string=NA, header = F, skip=1, sep=','))
colnames(my.data) <- b
# select the columns with names of 1 or 2
my.data <- my.data[names(my.data) %in% c("1", "2")]
# retain the original column names of 1 or 2
names(my.data) <- floor(as.numeric(names(my.data)))
# write the new data file with original column names
write.csv(
my.data, "c:/users/mark w miller/simple R programs/missing_headers_out.csv",
row.names=FALSE, quote=FALSE)
Here is the input data file. Note the commas with missing names for columns 1 and 2:
, , 1, 3, 1, 2
1415_at, 1, 8.512147859, 8.196725061, 8.174426394, 8.62388149
1411_at, 2, 9.119200527, 9.190318548, 9.149239039, 9.211401637
1412_at, 3, 10.03383593, 9.575728316, 10.06998673, 9.735217522
1413_at, 4, 5.925999419, 5.692092375, 5.689299161, 7.807354922
Here is the output data file:
1,1,2
8.512147859,8.174426394,8.62388149
9.119200527,9.149239039,9.211401637
10.03383593,10.06998673,9.735217522
5.925999419,5.689299161,7.807354922

Resources