I am trying to export datapoints from mongodb. I was unable to directly connect it to rstudio unfortunately. So from the query outcome I created a text file and attempted to read it as text file in R.
"cityid", "count"
"102","2"
"55","31"
"119","7"
"206","1"
"18","2"
"15","1"
"32","3"
"14","1"
"54","2"
"23","85"
"158","3"
"266","1"
"9","1"
"34","1"
"159","1"
"31","1"
"22","2"
"209","2"
"121","4"
"73","12"
"350","2"
"311","2"
"377","2"
"230","7"
"290","1"
"49","2"
"379","2"
"75","1"
"59","6"
"165","3"
"19","8"
"13","40"
"126","13"
"243","12"
"325","1"
"17","1"
"null","235"
"144","2"
"334","1"
"40","12"
"7","34"
"181","40"
"349","4"
So bascially the format is like above and I would like to convert this into a data frame of which I can make as reference for calculation with other datasets.
This is what I tried to do to make as data frame...
L <- readLines(file.choose())
L.df <- as.data.frame(L)
list <- strsplit(L.df, ",")
library("plyr")
df <- ldply(list)
colnames(df) <- c("city_id", "count")
str(df)
df$city_id <- suppressWarnings(as.numeric(as.character(df$city_id)))
At the last line, I tried to convert the character value as numeric value only to fail and coerced them into NA.
Does anyone have better suggestion to make them as numeric value table?
OR is there actually better way to bring the mongodb into R without copying and pasting them as text files? I was successful to connect to mongodb using Rmongo, but the syntax was way too complicated for me to understand.. The query I used is:
db.getCollection('logging_app_location_view_logs').aggregate([
{"$group": {"_id": "$city_id", "total": {"$sum":1}}}
]).forEach(function(l){
print('"' + l._id + '","' + l.total + '"');
});
Thanks in advance for your help!
You don't need to specify column names again when you have already passed header = TRUE in read.table function. colClasses argument will take care of the class of a column data.
df <- read.table(file.choose(), header = TRUE, sep = ",", colClasses = c('character', 'character'), na.strings = 'null')
# convert character to numeric format
char_cols <- which(sapply(df, class) == 'character') # identify character columns
df[char_cols] <- lapply(df[char_cols], as.numeric) # convert character to numeric column
Related
I have a CSV file with 141 rows and several columns. I wanted my data to be ordered in ascending order by the first two columns i.e. 'label' and 'index'. Following is my code:
final_data <- read.csv("./features.csv",
header = FALSE,
col.names = c('label','index', 'nr_pix', 'rows_with_1', 'cols_with_1',
'rows_with_3p', 'cols_with_3p', 'aspect_ratio',
'neigh_1', 'no_neigh_above', 'no_neigh_below',
'no_neigh_left', 'no_neigh_right', 'no_neigh_horiz',
'no_neigh_vert', 'connected_areas', 'eyes', 'custom'))
sorted_data_by_label <- final_data[order(label),]
sorted_data_by_index <- sorted_data_by_label[order(index),]
write.table(sorted_data_by_index, file = "./features.csv",
append = FALSE, sep = ',',
row.names = FALSE)
I chose to read from a CSV and use write.table because that was necessary for my code requirement to override the CSV with column names.
Now even when I added a , after order(label), and order(index), the code sorted data should still read other rows and columns right?
After running this code, I only get the first row out of 141 rows. Is there a way to fix this problem?
As #akrun has mentioned briefly, what you need to do is to change
sorted_data_by_label <- final_data[order(label),]
to
sorted_data_by_label <- final_data[order(final_data$label),]
and to change
sorted_data_by_index <- sorted_data_by_label[order(index),]
to
sorted_data_by_index <- sorted_data_by_label[order(sorted_data_by_label$index),]
This is because when you write label, R will try to find the index object in the global environment, not within the final_data data frame.
If you intended to use index that is a column of final_data, you need to use explicit final_data$index.
Other options
You can use with:
sorted_data_by_label <- with(final_data, final_data[order(label),])
sorted_data_by_index <- with(sorted_data_by_label, sorted_data_by_label[order(index),])
In dplyr you can use
sorted_data_by_label <- final_data %>% arrange(label)
sorted_data_by_index <- sorted_data_by_label %>% arrange(index)
I want to create a dataframe with 3 columns.
#First column
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
These names in column 1 are a bunch of named results of the cor.test function. The second column should consist of the correlation coefficents I get by writing ABC_D1$estimate, ABC_D2$estimate.
My problem is now that I dont want to add the $estimate manually to every single name of the first column. I tried this:
df1$C2 = paste0(df1$C1, '$estimate')
But this doesnt work, it only gives me this back:
"ABC_D1$estimate", "ABC_D2$estimate", "ABC_D3$estimate",
"ABC_E1$estimate", "ABC_E2$estimate", "ABC_E3$estimate",
"ABC_F1$estimate", "ABC_F2$estimate", "ABC_F3$estimate")
class(df1$C2)
[1] "character
How can I get the numeric result for ABC_D1$estimate in my dataframe? How can I convert these characters into Named num? The 3rd column should constist of the results of $p.value.
As pointed out by #DSGym there are several problems, including the it is not very convenient to have a list of character names, and it would be better to have a list of object instead.
Anyway, I think you can get where you want using:
estimates <- lapply(name_list, function(dat) {
dat_l <- get(dat)
dat_l[["estimate"]]
}
)
cbind(name_list, estimates)
This is not really advisable but given those premises...
Ok I think now i know what you need.
eval(parse(text = paste0("ABC_D1", '$estimate')))
You connect the two strings and use the functions parse and eval the get your results.
This it how to do it for your whole data.frame:
name_list = c("ABC_D1", "ABC_D2", "ABC_D3",
"ABC_E1", "ABC_E2", "ABC_E3",
"ABC_F1", "ABC_F2", "ABC_F3")
df1 = data.frame(C1 = name_list)
df1$C2 <- map_dbl(paste0(df1$C1, '$estimate'), function(x) eval(parse(text = x)))
I have this log file that has about 1200 characters (max) on a line. What I want to do is read this first and then extract certain portions of the file into new columns. I want to extract rows that contain the text “[DF_API: input string]”.
When I read it and then filter based on the rows that I am interested, it almost seems like I am losing data. I tried this using the dplyr filter and using standard grep with the same result.
Not sure why this is the case. Appreciate your help with this. The code and the data is there at the following link.
Satish
Code is given below
library(dplyr)
setwd("C:/Users/satis/Documents/VF/df_issue_dec01")
sec1 <- read.delim(file="secondary1_aa_small.log")
head(sec1)
names(sec1) <- c("V1")
sec1_test <- filter(sec1,str_detect(V1,"DF_API: input string")==TRUE)
head(sec1_test)
sec1_test2 = sec1[grep("DF_API: input string",sec1$V1, perl = TRUE),]
head(sec1_test2)
write.csv(sec1_test, file = "test_out.txt", row.names = F, quote = F)
write.csv(sec1_test2, file = "test2_out.txt", row.names = F, quote = F)
Data (and code) is given at the link below. Sorry, I should have used dput.
https://spaces.hightail.com/space/arJlYkgIev
Try this below code which could give you a dataframe of filtered lines from your file based a matching condition.
#to read your file
sec1 <- readLines("secondary1_aa_small.log")
#framing a dataframe by extracting required lines from above file
new_sec1 <- data.frame(grep("DF_API: input string", sec1, value = T))
names(new_sec1) <- c("V1")
Edit: Simple way to split the above column into multiple columns
#extracting substring in between < & >
new_sec1$V1 <- gsub(".*[<\t]([^>]+)[>].*", "\\1", new_sec1$V1)
#replacing comma(,) with a white space
new_sec1$V1 <- gsub("[,]+", " ", new_sec1$V1)
#splitting into separate columns
new_sec1 <- strsplit(new_sec1$V1, " ")
new_sec1 <- lapply(new_sec1, function(x) x[x != ""] )
new_sec1 <- do.call(rbind, new_sec1)
new_sec1 <- data.frame(new_sec1)
Change columns names for your analysis.
As example, I have the following XML code
tt = '<Nummeraanduiding>
<identificatie>0010200000114849</identificatie>
<aanduidingRecordInactief>N</aanduidingRecordInactief>
<aanduidingRecordCorrectie>0</aanduidingRecordCorrectie>
<huisnummer>13</huisnummer>
<officieel>N</officieel>
<postcode>9904PC</postcode>
<tijdvakgeldigheid>
<begindatumTijdvakGeldigheid>2010051100000000</begindatumTijdvakGeldigheid>
</tijdvakgeldigheid>
<inOnderzoek>N</inOnderzoek>
<typeAdresseerbaarObject>Verblijfsobject</typeAdresseerbaarObject>
<bron>
<documentdatum>20100511</documentdatum>
<documentnummer>2010/NR002F</documentnummer>
</bron>
<nummeraanduidingStatus>Naamgeving uitgegeven</nummeraanduidingStatus>
<gerelateerdeOpenbareRuimte>
<identificatie>0010300000000444</identificatie>
</gerelateerdeOpenbareRuimte>
</Nummeraanduiding> '
The goal is to convert this node(Nummeraanduiding) to a data.table (or data.frame is also fine). One challenge is that I have a lot of these Nummeraanduiding nodes (millions of them).
The following code is able to process the data:
library(XML)
# This parses the doc...
doc = xmlParse(tt)
# Solution (1) - this is the most obvious solution..
XML::xmlToDataFrame(doc)
# Solution (2) - apparently converting to a list is also possible..
unlist(xmlToList(doc))
# Solution (3) - My own solution
data.frame(as.list(unlist(xmlToList(doc))))
Not all solutions produce the desired result... In the end only the version of Solution (3) satisfies my needs.
It is in a data.frame/data.table format
It contains all the child-child-nodes and has distinct names for each column
It does not 'merge' the information of child-child-nodes
However, running this piece of code for all my data becomes quite slow. It took 8+ hours to complete it for a file containing 2290000 times the 'Nummeraanduiding'-node.
Do you guys know any way to speed up this process? Can my method be improved? Am I missing some useful function maybe?
Given that each field is already on a separate line just grep them out, read what is left using read.table and convert from long to wide using tapply to produce the resulting matrix (which can be converted to a data frame or data.table if desired). Note that in read.table we bypass quote, comment and class processing. Finally, test it out to see if it is faster. No packages are used.
nms <- c("identificatie", "aanduidingRecordInactief", "aanduidingRecordCorrectie",
"huisnummer", "officieel", "postcode", "tijdvakgeldigheid.begindatumTijdvakGeldigheid",
"inOnderzoek", "typeAdresseerbaarObject", "bron.documentdatum",
"bron.documentnummer", "nummeraanduidingStatus",
"gerelateerdeOpenbareRuimte.identificatie")
rx <- paste(nms, collapse = "|")
g <- chartr("<", ">", grep(rx, readLines(textConnection(tt)), value = TRUE))
long <- read.table(text = g, sep = ">", quote = "", comment.char = "",
colClasses = "character")[2:3]
names(long) <- c("field", "value")
long$field <- factor(long$field, levels = nms) # maintain order of columns
long$recno <- cumsum(long$field == "identificatie")
with(long, tapply(value, list(recno, field), c))
If all records have exactly the same set of fields, such as those listed in nms, then the last line could be replaced with this (which is likely faster):
matrix(long$value, ncol = length(nms), byrow = TRUE, dimnames = list(NULL, nms))
Another alternative to the tapply line would be to use reshape from base R or to use dcast from the reshape2 package.
I am using R to do some data pre-processing, and here is the problem that I am faced with: I input the data using read.csv(filename,header=TRUE), and then the space in variable names became ".", for example, a variable named Full Code became Full.Code in the generated dataframe. After the processing, I use write.xlsx(filename) to export the results, while the variable names are changed. How to address this problem?
Besides, in the output .xlsx file, the first column become indices(i.e., 1 to N), which is not what I am expecting.
If your set check.names=FALSE in read.csv when you read the data in then the names will not be changed and you will not need to edit them before writing the data back out. This of course means that you would need quote the column names (back quotes in some cases) or refer to the columns by location rather than name while editing.
To get spaces back in the names, do this (right before you export - R does let you have spaces in variable names, but it's a pain):
# A simple regular expression to replace dots with spaces
# This might have unintended consequences, so be sure to check the results
names(yourdata) <- gsub(x = names(yourdata),
pattern = "\\.",
replacement = " ")
To drop the first-column index, just add row.names = FALSE to your write.xlsx(). That's a common argument for functions that write out data in tabular format (write.csv() has it, too).
Here's a function (sorry, I know it could be refactored) that makes nice column names even if there are multiple consecutive dots and trailing dots:
makeColNamesUserFriendly <- function(ds) {
# FIXME: Repetitive.
# Convert any number of consecutive dots to a single space.
names(ds) <- gsub(x = names(ds),
pattern = "(\\.)+",
replacement = " ")
# Drop the trailing spaces.
names(ds) <- gsub(x = names(ds),
pattern = "( )+$",
replacement = "")
ds
}
Example usage:
ds <- makeColNamesUserFriendly(ds)
Just to add to the answers already provided, here is another way of replacing the “.” or any other kind of punctation in column names by using a regex with the stringr package in the way like:
require(“stringr”)
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
For example try:
data <- data.frame(variable.x = 1:10, variable.y = 21:30, variable.z = "const")
colnames(data) <- str_replace_all(colnames(data), "[:punct:]", " ")
and
colnames(data)
will give you
[1] "variable x" "variable y" "variable z"