strpslit a character array and convert to dataframe simultaneously - r

I have what feels like a difficult data manipulation problem, and am hoping to get some guidance. Here is a test version of what my current array looks like, as well as what dataframe I hope to obtain:
dput(test)
c("<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>", "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>")
test
[1] "<play quarter=\"1\" oncourt-id=\"\" time-minutes=\"12\" time-seconds=\"0\" id=\"1\"/>"
[2] "<play quarter=\"2\" oncourt-id=\"\" time-minutes=\"10\" id=\"1\"/>"
desired_df
quarter oncourt-id time-minutes time-seconds id
1 1 NA 12 0 1
2 3 NA 10 NA 1
There are a few problems I am dealing with:
the character array "test" has backslashes where there should be nothing, but i was having difficulty using gsub in this format gsub("\", "", test).
not every element in test has the same number of entries, note in the example that the 2nd element doesn't have time-seconds, and so for the dataframe I would prefer it to return NA.
I have tried using strsplit(test, " ") to first split on spaces, which only exist between different column entires, but then I am returned with a list of lists that is just as difficult to deal with.

You've got xml there. You could parse it, then run rbindlist on the result. This will probably be a lot less hassle than trying to split the name-value pairs as strings.
dflist <- lapply(test, function(x) {
df <- as.data.frame.list(XML::xmlToList(x))
is.na(df) <- df == ""
df
})
data.table::rbindlist(dflist, fill = TRUE)
# quarter oncourt.id time.minutes time.seconds id
# 1: 1 NA 12 0 1
# 2: 2 NA 10 NA 1
Note: You will need the XML and data.table packages for this solution.

Related

Replacing empty cells in a column with values from another column in R

I am trying to pull the cell values from the StudyID column to the empty cells SigmaID column, but I am running into an odd issue with the output.
This is how my data looks before running commands.
StudyID Gender Region SigmaID
LM24008 1 20 LM24008
LM82993 1 16 LM28888
ST04283 0 44
ST04238 0 50
LM04829 1 24 LM23921
ST91124 0 89
ST29001 0 55
I tried accomplishing this by writing the syntax in three ways, because I wasn't sure if there is a problem with the way the logic was set up. All three produce the same output.
df$SigmaID <- ifelse(test = df$SigmaID != "", yes = df$SigmaID, no = df$StudyID)
df$SigmaID <- ifelse(df$SigmaID == "", df$StudyID, df3$SigmaID)
df %>% mutate(SigmaID = ifelse(Gender == 0, df$StudyID, df$SigmaID)
Output: instead of pulling the values from from the StudyID column, it is populating one to four digit numbers.
StudyID Gender Region SigmaID
LM24008 1 20 LM24008
LM82993 1 16 LM28888
ST04283 0 44 5
ST04238 0 50 4908
LM04829 1 24 LM23921
ST91124 0 89 209
ST29001 0 55 4092
I have tried recoding the empty spaces to NA and then calling on NA in the logic, but this produced the same output as seen above. I'm wondering if it could have anything to do with variable type or variable attributes and something's off about how it's reading the characters in StudyID. Would appreciate feedback on this issue!
Here is how to do it:
df$SigmaID[df$SigmaID == ""] = df$StudyID[df$SigmaID == ""]
df[df$SigmaID == ""] selects only the rows where SigmaID==""
I also recommend using data.table instead of data.frame. It is faster and has some useful syntax features:
library(data.table)
setDT(df) # setDT converts a data.frame to a data.table
df[SigmaID=="",SigmaId:=StudyID]
Following up on this! As it turns out, default R converts string types into factors. There are a few ways of addressing the issue above.
i <- sapply[df, is.factor]
df[i] <- lapply(df[i], as.character)
Another method:
df <- read.csv("/insert file pathway here", stringAsFactors = FALSE)
This is what I found to be helpful! I'm sure there are additional methods of troubleshooting this as well.

Parsing a string efficiently

So I've got a column in my data frame that is essentially one long characteristic string that is used to encode about variables for each record. It might look something like this:
string<-c('001034002025003996','001934002199004888')
But much longer.
The strings are structured so each 6 characters are paired together. So you can look at the string above like this:
001034 002025 003996
001934 002199 004888
The first three characters of these is a code corresponding to a certain variable and the next three correspond to the value of that variable. So the above can be broken down into three columns that look like this:
var001 var002 var003 var004
1 034 025 996 NA
2 934 199 NA 888
I need a way to parse this string and return a data frame with the expanded columns.
I wrote a nested loop that looks like this:
for(i in 1:length(string)){
text <- string[i]
for(j in seq(1,505,6)){
var <- substr(text,j, j+2)
var.value <- substr(text, j+3, j+5)
index <- (as.numeric(var))
df[i, index] <- var.value
}
}
where df is an empty data frame created to receive the data. This works, but is slow on larger amounts of data. Is there a better way to do this?
1) This one-liner produces a character matrix (which can easily be converted to a data.frame if need be). No packages are used.
read.dcf(textConnection(gsub("(...)(...)", "\\1: \\2\n", string)))
giving:
001 002 003 004
[1,] "034" "025" "996" NA
[2,] "934" "199" NA "888"
2) This alternative produces the same matrix. The read.table produces a long form data.frame and then tapply reshapes it to a wide matrix.
long <- read.table(text = gsub("(...)(...)", "\\1 \\2\n", string),
colClasses = "character", col.names = c("id", "var"))
tapply(long$var, list(gl(length(string), nchar(string[1])/6), long$id), c)

Appending to an R List one by one

Let's say I have data like:
> data[295:300,]
Date sulfate nitrate ID
295 2003-10-22 NA NA 1
296 2003-10-23 NA NA 1
297 2003-10-24 3.47 0.363 1
298 2003-10-25 NA NA 1
299 2003-10-26 NA NA 1
300 2003-10-27 NA NA 1
Now I would like to add all the nitrate values into a new list/vector. I'm using the following code:
i <- 1
my_list <- c()
for(val in data)
{
my_list[i] <- val
i <- i + 1
}
But this is what happens:
Warning message:
In x[i] <- val :
number of items to replace is not a multiple of replacement length
> i
[1] 2
> x
[1] NA
Where am I going wrong? The data is part of a Coursera R Programming coursework. I can assure you that this is not an assignment/quiz. I have been trying to understand what is the best way append elements into a list with a loop? I have not proceeded to the lapply or sapply part of the coursework, so thinking about workarounds.
Thanks in advance.
If it's a duplicate question, please direct me to it.
As we mention in the comments, you are not looping over the rows of your data frame, but the columns (also sometimes variables). Hence, loop over data$nitrate.
i <- 1
my_list <- c()
for(val in data$nitrate)
{
my_list[i] <- val
i <- i + 1
}
Now, instead of looping over your values, a better way is to use that you want the new vector and the old data to have the same index, so loop over the index i. How do you tell R how many indexes there are? Here you have several choices again: 1:nrow(data), 1:length(data$nitrate) and several other ways. Below I have given you a few examples of how to extract from the data frame.
my_vector <- c()
for(i in 1:nrow(data)){
my_vector[i] <- data$nitrate[i] ## Version 1 of extracting from data.frame
my_vector[i] <- data[i,"nitrate"] ## Version 2: [row,column name]
my_vector[i] <- data[i,3] ## Version 3: [row,column number]
}
My suggestion: Rather than calling the collection a list, call it a vector, since that is what it is. Vectors and lists behave a little differently in R.
Of course, in reality you don't want to get the data out one by one. A much more efficient way of getting your data out is
my_vector2 <- data$nitrate

Lists of term-frequency pairs into a matrix in R

I have a large data set in the following format, where on each line there is a document, encoded as word:freqency-in-the-document, separated by space; lines can be of variable length:
aword:3 bword:2 cword:15 dword:2
bword:4 cword:20 fword:1
etc...
E.g., in the first document, "aword" occurs 3 times. What I ultimately want to do is to create a little search engine, where the documents (in the same format) matching a query are ranked; I though about using TfIdf and the tm package (based on this tutorial, which requires the data to be in the format of a TermDocumentMatrix: http://anythingbutrbitrary.blogspot.be/2013/03/build-search-engine-in-20-minutes-or.html). Otherwise, I would just use tm's TermDocumentMatrix function on a corpus of text, but the catch here is that I already have these data indexed in this format (and I'd rather like to use these data, unless the format is truly something alien and cannot be converted).
What I've tried so far is to import the lines and split them:
docs <- scan("data.txt", what="", sep="\n")
doclist <- strsplit(docs, "[[:space:]]+")
I figured I would put something like this in a loop:
doclist2 <- strsplit(doclist, ":", fixed=TRUE)
and somehow get the paired values into an array, and then run a loop that populates a matrix (pre-filled with zeroes: matrix(0,x,y)) by fetching the appripriate values from the word:freq pairs (would that in itself be a good idea to construct a matrix?). But this way of converting does not seem like a good way to do it, the lists keep getting more complicated, and I wouldn't still know how to get to the point where I can populate the matrix.
What I (think I) would need in the end is a matrix like this:
doc1 doc2 doc3 doc4 ...
aword 3 0 0 0
bword 2 4 0 0
cword: 15 20 0 0
dword 2 0 0 0
fword: 0 1 0 0
...
which I could then convert into a TermDocumentMatrix and get started with the tutorial. I have a feeling I am missing something very obvious here, something I probably cannot find because I don't know what these things are called (I've been googling for a day, on the theme of "term document vector/array/pairs", "two-dimensional array", "list into matrix" etc).
What would be a good way to get such a list of documents into a matrix of term-document frequencies? Alternatively, if the solution would be too obvious or doable with built-in functions: what is the actual term for the format that I described above, where there are those term:frequency pairs on a line, and each line is a document?
Here's an approach that gets you the output you say you might want:
## Your sample data
x <- c("aword:3 bword:2 cword:15 dword:2", "bword:4 cword:20 fword:1")
## Split on a spaces and colons
B <- strsplit(x, "\\s+|:")
## Add names to your list to represent the source document
B <- setNames(B, paste0("document", seq_along(B)))
## Put everything together into a long matrix
out <- do.call(rbind, lapply(seq_along(B), function(x)
cbind(document = names(B)[x], matrix(B[[x]], ncol = 2, byrow = TRUE,
dimnames = list(NULL, c("word", "count"))))))
## Convert to a data.frame
out <- data.frame(out)
out
# document word count
# 1 document1 aword 3
# 2 document1 bword 2
# 3 document1 cword 15
# 4 document1 dword 2
# 5 document2 bword 4
# 6 document2 cword 20
# 7 document2 fword 1
## Make sure the counts column is a number
out$count <- as.numeric(as.character(out$count))
## Use xtabs to get the output you want
xtabs(count ~ word + document, out)
# document
# word document1 document2
# aword 3 0
# bword 2 4
# cword 15 20
# dword 2 0
# fword 0 1
Note: Answer edited to use matrices in the creation of "out" to minimize the number of calls to read.table which would be a major bottleneck with bigger data.

Change multiple dataframes in a loop

I have, for example, this three datasets (in my case, they are many more and with a lot of variables):
data_frame1 <- data.frame(a=c(1,5,3,3,2), b=c(3,6,1,5,5), c=c(4,4,1,9,2))
data_frame2 <- data.frame(a=c(6,0,9,1,2), b=c(2,7,2,2,1), c=c(8,4,1,9,2))
data_frame2 <- data.frame(a=c(0,0,1,5,1), b=c(4,1,9,2,3), c=c(2,9,7,1,1))
on each data frame I want to add a variable resulting from a transformation of an existing variable on that data frame. I would to do this by a loop. For example:
datasets <- c("data_frame1","data_frame2","data_frame3")
vars <- c("a","b","c")
for (i in datasets){
for (j in vars){
# here I need a code that create a new variable with transformed values
# I thought this would work, but it didn't...
get(i)$new_var <- log(get(i)[,j])
}
}
Do you have some valid suggestions about that?
Moreover, it would be great for me if it were possible also to assign the new column names (in this case new_var) by a character string, so I could create the new variables by another for loop nested in the other two.
Hope I've not been too tangled in explain my problem.
Thanks in advance.
You can put your dataframes in a list and use lapply to process them one by one. So no need to use a loop in this case.
For example you can do this :
data_frame1 <- data.frame(a=c(1,5,3,3,2), b=c(3,6,1,5,5), c=c(4,4,1,9,2))
data_frame2 <- data.frame(a=c(6,0,9,1,2), b=c(2,7,2,2,1), c=c(8,4,1,9,2))
data_frame3 <- data.frame(a=c(0,0,1,5,1), b=c(4,1,9,2,3), c=c(2,9,7,1,1))
ll <- list(data_frame1,data_frame2,data_frame3)
lapply(ll,function(df){
df$log_a <- log(df$a) ## new column with the log a
df$tans_col <- df$a+df$b+df$c ## new column with sums of some columns or any other
## transformation
### .....
df
})
the dataframe1 becomes :
[[1]]
a b c log_a tans_col
1 1 3 4 0.0000000 8
2 5 6 4 1.6094379 15
3 3 1 1 1.0986123 5
4 3 5 9 1.0986123 17
5 2 5 2 0.6931472 9
I had the same need and wanted to change also the columns in my actual list of dataframes.
I found a great method here (the purrr::map2 method in the question works for dataframes with different columns), followed by
list2env(list_of_dataframes ,.GlobalEnv)

Resources