converting simple xml node into R dataframe get empty values - r

Having a very simple xml I want to export to dataframe in R.
<root>
<source>
<sourceId value="8556"/>
</source>
<content>
<DESCRIPTION value="0"/>
<SORTED value="290"/>
<ANNULATION value="34"/>
<RECORDING value="5665"/>
<TOLOCK value=""/>
<FUTURE value="categorical"/>
</content>
</root>
I retrieve the node I need this way:
library(XML)
xmlDoc <- xmlParse("path-to-file", useInternalNode=TRUE)
df <- xmlToDataFrame(getNodeSet(xmlDoc,"//content"))
but dataframe has only columns with no value at all. So I guess I am wrong in some step.
> df
DESCRIPTION SORTED ANNULATION RECORDING TOLOCK FUTURE
1
> str(df)
'data.frame': 1 obs. of 6 variables:
$ DESCRIPTION: chr ""
$ SORTED : chr ""
$ ANNULATION : chr ""
$ RECORDING : chr ""
$ TOLOCK : chr ""
$ FUTURE : chr ""

Usually, xml processing is very dependant on the file. So you have to struggle with it as there is no silver bullet.
In your case, just iterate throug names and values from tags this way assuming you want it in one row (not very pretty I must say):
doc <- read_xml("my.xml")
content <- xml_find_first(doc,".//content")
values <- xml_children(content) %>% xml_attr("value")
names <- xml_name(xml_children(content))
df <- data.frame(mstrix(ncol = length(names), nrow = 0))
df <- rbind(df, values)
colnames(df) <- names

Related

How to read file with irregularly nested quotations?

I have a file with irregular quotes like the following:
"INDICATOR,""CTY_CODE"",""MGN_CODE"",""EVENT_NR"",""EVENT_NR_CR"",""START_DATE"",""PEAK_DATE"",""END_DATE"",""MAX_EXT_ON"",""DURATION"",""SEVERITY"",""INTENSITY"",""AVERAGE_AREA"",""WIDEST_AREA_PERC"",""SCORE"",""GRP_ID"""
"Spi-3,""AFG"","""",1,1,""1952-10-01"",""1952-11-01"",""1953-06-01"",""1952-11-01"",9,6.98,0.78,19.75,44.09,5,1"
It seems irregular because the first column is only wrapped in single quotes, whereas every subsequent column is wrapped in double quotes. I'd like to read it so that every column is imported without quotes (neither in the header, nor the data).
What I've tried is the following:
# All sorts of tidyverse imports
tib <- readr::read_csv("file.csv")
And I also tried the suggestions offered here:
# Base R import
DF0 <- read.table("file.csv", as.is = TRUE)
DF <- read.csv(text = DF0[[1]])
# Data table import
DT0 <- fread("file.csv", header =F)
DT <- fread(paste(DT0[[1]], collapse = "\n"))
But even when it imports the file in the latter two cases, the variable names and some of the elements are wrapped in quotation marks.
I used data.table::fread with the quote="" option (which is "as is").
Then I cleaned the names and data by eliminating all the quotes.
The dates could be converted too, but I didn't do that.
library(data.table)
library(magrittr)
DT0 <- fread('file.csv', quote = "")
DT0 %>% setnames(names(.), gsub('"', '', names(.)))
string_cols <- which(sapply(DT0, class) == 'character')
DT0[, (string_cols) := lapply(.SD, function(x) gsub('\\"', '', x)),
.SDcols = string_cols]
str(DT0)
Classes ‘data.table’ and 'data.frame': 1 obs. of 16 variables:
$ INDICATOR : chr "Spi-3"
$ CTY_CODE : chr "AFG"
$ MGN_CODE : chr ""
$ EVENT_NR : int 1
$ EVENT_NR_CR : int 1
$ START_DATE : chr "1952-10-01"
$ PEAK_DATE : chr "1952-11-01"
$ END_DATE : chr "1953-06-01"
$ MAX_EXT_ON : chr "1952-11-01"
$ DURATION : int 9
$ SEVERITY : num 6.98
$ INTENSITY : num 0.78
$ AVERAGE_AREA : num 19.8
$ WIDEST_AREA_PERC: num 44.1
$ SCORE : int 5
$ GRP_ID : chr "1"
- attr(*, ".internal.selfref")=<externalptr>

How to import multiple files into a list while keeping their names?

I am reading several SAS files from a server and load them all into a list into R. I removed one of the datasets because I didn't need it in the final analysis ( dateset # 31)
mylist<-list.files("path" , pattern = ".sas7bdat")
mylist <- mylist[- 31]
Then I used lapply to read all the datasets in the list ( mylist) at the same time
read.all <- lapply(mylist, read_sas)
the code works well. However when I run view(read.all) to see the the datasets, I can only see a number ( e.g, 1, 2, etc) instead of the names of the initial datasets.
Does anyone know how I can keep the name of datasets in the final list?
Also, can anyone tell me how I can work with this list in R?
is it an object ? may I read one of the dateset of the list ? or how can I join some of the datasets of the list?
Use basename and tools::file_path_sans_ext:
filenames <- head(list.files("~/StackOverflow", pattern = "^[^#].*\\.R", recursive = TRUE, full.names = TRUE))
filenames
# [1] "C:\\Users\\r2/StackOverflow/1000343/61469332.R" "C:\\Users\\r2/StackOverflow/10087004/61857346.R"
# [3] "C:\\Users\\r2/StackOverflow/10097832/60589834.R" "C:\\Users\\r2/StackOverflow/10214507/60837843.R"
# [5] "C:\\Users\\r2/StackOverflow/10215127/61720149.R" "C:\\Users\\r2/StackOverflow/10226369/60778116.R"
basename(filenames)
# [1] "61469332.R" "61857346.R" "60589834.R" "60837843.R" "61720149.R" "60778116.R"
tools::file_path_sans_ext(basename(filenames))
# [1] "61469332" "61857346" "60589834" "60837843" "61720149" "60778116"
somedat <- setNames(lapply(filenames, readLines, n=2),
tools::file_path_sans_ext(basename(filenames)))
names(somedat)
# [1] "61469332" "61857346" "60589834" "60837843" "61720149" "60778116"
str(somedat)
# List of 6
# $ 61469332: chr [1:2] "# https://stackoverflow.com/questions/61469332/determine-function-name-within-that-function/61469380" ""
# $ 61857346: chr [1:2] "# https://stackoverflow.com/questions/61857346/how-to-use-apply-family-instead-of-nested-for-loop-for-my-problem?noredirect=1" ""
# $ 60589834: chr [1:2] "# https://stackoverflow.com/questions/60589834/add-columns-to-data-frame-based-on-function-argument" ""
# $ 60837843: chr [1:2] "# https://stackoverflow.com/questions/60837843/how-to-remove-all-parentheses-from-a-vector-of-string-except-whe"| __truncated__ ""
# $ 61720149: chr [1:2] "# https://stackoverflow.com/questions/61720149/extracting-the-original-data-based-on-filtering-criteria" ""
# $ 60778116: chr [1:2] "# https://stackoverflow.com/questions/60778116/how-to-shift-data-by-a-factor-of-two-months-in-r" ""
Each "name" is the character representation of (in this case) the stackoverflow question number, with the ".R" removed. (And since I typically include the normal URL as the first line then an empty line in the files I use to test/play and answer SO questions, all of these files look similar at the top two lines.)

Change datatype of multiple columns in dataframe in R

I have the following dataframe:
str(dat2)
data.frame: 29081 obs. of 105 variables:
$ id: int 20 34 46 109 158....
$ reddit_id: chr "t1_cnas90f" "t1_cnas90t" "t1_cnas90g"....
$ subreddit_id: chr "t5_cnas90f" "t5_cnas90t" "t5_cnas90g"....
$ link_id: chr "t3_c2qy171" "t3_c2qy172" "t3_c2qy17f"....
$ created_utc: chr "2015-01-01" "2015-01-01" "2015-01-01"....
$ ups: int 3 1 0 1 2....
...
How can i change the datatype of reddit_id, subreddit_id and link_id from character to factor? I know how to do it one column by column, but as this is tedious work, i am searching for a faster way to do it.
I have tried the following, without success:
dat2[2:4] <- data.frame(lapply(dat2[2:4], factor))
From this approach. Its end up giving me an error message: invalid "length" argument
Another approach was to do it this way:
dat2 <- as.factor(data.frame(dat2$reddit_id, dat2$subreddit_id, dat2$link_id))
Result: Error in sort.list(y): "x" must be atomic for "sort.
After reading the error i also tried it the other way around:
dat2 <- data.frame(as.factor(dat2$reddit_id, dat2$subreddit_id, dat2$link_id))
Also without success
If some information are missing, I am sorry. I am a newbie to R and Stackoverflow...Thank you for your help!!!
Try with:
library("tidyverse")
data %>%
mutate_at(.vars = vars(reddit_id, subreddit_id, link_id)),
.fun = factor)
To take advantage of partial matching, use
data %>%
mutate_at(.vars = vars(contains("reddit"), link_id),
.fun = factor)

R read.csv "More columns than column names" error

I have a problem when importing .csv file into R. With my code:
t <- read.csv("C:\\N0_07312014.CSV", na.string=c("","null","NaN","X"),
header=T, stringsAsFactors=FALSE,check.names=F)
R reports an error and does not do what I want:
Error in read.table(file = file, header = header, sep = sep, quote = quote, :
more columns than column names
I guess the problem is because my data is not well formatted. I only need data from [,1:32]. All others should be deleted.
Data can be downloaded from:
https://drive.google.com/file/d/0B86_a8ltyoL3VXJYM3NVdmNPMUU/edit?usp=sharing
Thanks so much!
Open the .csv as a text file (for example, use TextEdit on a Mac) and check to see if columns are being separated with commas.
csv is "comma separated vectors". For some reason when Excel saves my csv's it uses semicolons instead.
When opening your csv use:
read.csv("file_name.csv",sep=";")
Semi colon is just an example but as someone else previously suggested don't assume that because your csv looks good in Excel that it's so.
That's one wonky CSV file. Multiple headers tossed about (try pasting it to CSV Fingerprint) to see what I mean.
Since I don't know the data, it's impossible to be sure the following produces accurate results for you, but it involves using readLines and other R functions to pre-process the text:
# use readLines to get the data
dat <- readLines("N0_07312014.CSV")
# i had to do this to fix grep errors
Sys.setlocale('LC_ALL','C')
# filter out the repeating, and wonky headers
dat_2 <- grep("Node Name,RTC_date", dat, invert=TRUE, value=TRUE)
# turn that vector into a text connection for read.csv
dat_3 <- read.csv(textConnection(paste0(dat_2, collapse="\n")),
header=FALSE, stringsAsFactors=FALSE)
str(dat_3)
## 'data.frame': 308 obs. of 37 variables:
## $ V1 : chr "Node 0" "Node 0" "Node 0" "Node 0" ...
## $ V2 : chr "07/31/2014" "07/31/2014" "07/31/2014" "07/31/2014" ...
## $ V3 : chr "08:58:18" "08:59:22" "08:59:37" "09:00:06" ...
## $ V4 : chr "" "" "" "" ...
## .. more
## $ V36: chr "" "" "" "" ...
## $ V37: chr "0" "0" "0" "0" ...
# grab the headers
headers <- strsplit(dat[1], ",")[[1]]
# how many of them are there?
length(headers)
## [1] 32
# limit it to the 32 columns you want (Which matches)
dat_4 <- dat_3[,1:32]
# and add the headers
colnames(dat_4) <- headers
str(dat_4)
## 'data.frame': 308 obs. of 32 variables:
## $ Node Name : chr "Node 0" "Node 0" "Node 0" "Node 0" ...
## $ RTC_date : chr "07/31/2014" "07/31/2014" "07/31/2014" "07/31/2014" ...
## $ RTC_time : chr "08:58:18" "08:59:22" "08:59:37" "09:00:06" ...
## $ N1 Bat (VDC) : chr "" "" "" "" ...
## $ N1 Shinyei (ug/m3): chr "" "" "0.23" "null" ...
## $ N1 CC (ppb) : chr "" "" "null" "null" ...
## $ N1 Aeroq (ppm) : chr "" "" "null" "null" ...
## ... continues
If you only need the first 32 columns, and you know how many columns there are, you can set the other columns classes to NULL.
read.csv("C:\\N0_07312014.CSV", na.string=c("","null","NaN","X"),
header=T, stringsAsFactors=FALSE,
colClasses=c(rep("character",32),rep("NULL",10)))
If you do not want to code up each colClass and you like the guesses read.csv then just save that csv and open it again.
Alternatively, you can skip the header and name the columns yourself and remove the misbehaved rows.
A<-data.frame(read.csv("N0_07312014.CSV",
header=F,stringsAsFactors=FALSE,
colClasses=c(rep("character",32),rep("NULL",5)),
na.string=c("","null","NaN","X")))
Yournames<-as.character(A[1,])
names(A)<-Yournames
yourdata<-unique(A)[-1,]
The code above assumes you do not want any duplicate rows. You can alternatively remove rows that have the first entry equal to the first column name, but I'll leave that to you.
try read.table() instead of read.csv()
I was also facing the same issue. Now solved.
Just use header = FALSE
read.csv("data.csv", header = FALSE) -> mydata
I had the same problem. I opened my data in textfile and double expressions are separated by semicolons, you should replace them with a period
I was having this error that was caused by multiple rows of meta data at the top of the file. I was able to use read.csv by doing skip= and skipping those rows.
data <- read.csv('/blah.csv',skip=3)
For me, the solution was using csv2 instead of csv.
read.csv("file_name.csv", header=F)
Setting the HEADER to be FALSE will do the job perfectly for you...

Replace row in data.frame

I have a dataframe which looks like that:
'data.frame': 3036 obs. of 751 variables:
$ X : chr "01.01.2002" "02.01.2002" "03.01.2002" "04.01.2002" ...
$ A: chr "na" "na" "na" "na" ...
$ B: chr "na" "1,827437365" "0,833922973" "-0,838923572" ...
$ C: chr "na" "1,825300613" "0,813299479" "-0,866639008" ...
$ D: chr "na" "1,820482187" "0,821374034" "-0,875963104" ...
...
I have converted the X row into a date format.
dates <- as.Date(dataFrame$X, '%d.%m.%Y')
Now I want to replace this row. The thing is I cannot create a new dataframe because I after D there are coming over 1000 more rows...
What would be a possible way to do that easily?
I think what you want is simply:
dataFrame$X <- dates
if you you want to do is replace column X with dates. If you want to remove column X, simply do the following:
dataFrame$X <- NULL
(edited with more concise removal method provided by user #shujaa)

Resources