Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
I want to convert a XML to a dataframe.
I'm aware of XML::xmlToDataFrame, but it gives an error in my case.
The XML can be found here:
https://api.data.gov.hk/v1/historical-archive/get-file?url=https%3A%2F%2Fresource.data.one.gov.hk%2Ftd%2Ftraffic-detectors%2FrawSpeedVol-all.xml&time=20211216-0513
Thanks for all answers!
Since your XML file contains multiple nested children, XML::xmlToDataFrame was giving out error.
I've approached the problem using the naive method but it works!
Here's what I've done:
The following code creates a dataframe with the tags inside `'.
library(xml2)
require(XML)
pg <- read_xml("https://s3-ap-southeast-1.amazonaws.com/historical-resource-archive/2021/12/16/https%253A%252F%252Fresource.data.one.gov.hk%252Ftd%252Ftraffic-detectors%252FrawSpeedVol-all.xml/0513")
records <- xml_find_all(pg, "//lane")
nodenames<-xml_name(xml_children(records))
nodevalues<-trimws(xml_text(xml_children(records)))
lane_id <- nodevalues[seq(1, length(nodevalues), 6)]
speed <- nodevalues[seq(2, length(nodevalues), 6)]
occupancy <- nodevalues[seq(3, length(nodevalues), 6)]
volume <- nodevalues[seq(4, length(nodevalues), 6)]
s.d. <- nodevalues[seq(5, length(nodevalues), 6)]
valid <- nodevalues[seq(6, length(nodevalues), 6)]
df <- data.frame(lane_id, speed, occupancy, volume, s.d., valid)
head(df)
The df looks like this:
lane_id speed occupancy volume s.d. valid
1 Fast Lane 70 0 0 0 Y
2 Middle Lane 76 6 3 11.1 Y
3 Slow Lane 70 6 0 0 Y
4 Fast Lane 82 1 1 0 Y
5 Middle Lane 63 3 1 0 Y
6 Slow Lane 79 2 1 0 Y
If you want to extract the data of <detectors>, you can use the following code:
################ Extract Detector Data #########
records2 <- xml_find_all(pg, "//detector")
vals2 <- trimws(xml_text(records2))
nodenames2 <-xml_name(xml_children(records2))
nodevalues2 <-trimws(xml_text(xml_children(records2)))
detector_id <- nodevalues2[seq(1, length(nodevalues2), 3)]
direction <- nodevalues2[seq(2, length(nodevalues2), 3)]
lanes <- nodevalues2[seq(3, length(nodevalues2), 3)]
df2 <- data.frame(detector_id, direction, lanes)
head(df2)
The df2 looks like this:
detector_id direction lanes
1 AID01101 South East Fast Lane70000YMiddle Lane766311.1YSlow Lane70600Y
2 AID01102 North East Fast Lane82110YMiddle Lane63310YSlow Lane79210Y
3 AID01103 South East Fast Lane50000YMiddle Lane65210YSlow Lane192310Y
4 AID01104 North East Fast Lane50000YSlow Lane63110Y
5 AID01105 North East Fast Lane50100YSlow Lane53410Y
6 AID01106 South East Fast Lane50300YSlow Lane56510Y
But, as you can notice, the lanes column isn't cleaned as you would like since it is a grandchild tag inside the XML.
Although, you could create a new data frame from df and df2 as you would like.
Related
I have a data set that looks like this
lat_deg lat_min long_deg long_min
site 1 44 4.971 80 27.934
site 2
site 3
site 4
site 5
site <- c(1,2,3)
lat_deg <- c(44,44,44)
lat_min <- c(4.971, 4.977, 4.986)
long_deg <- c(80,80,80)
long_min <- c(27.934, 27.977, 27.986)
df <- data.frame(site, lat_deg, lat_min, long_deg, long_min)
How do I convert this into degree decimal only? All my lats are in N and longs in W, so I'm not too worried except the final sign should be correct. Additionally, will I be able to calculate altitude from here?
Note: All other questions on SO focus on DMS to DD. This question has not been asked before.
Based on the data given, here is a approach using dplyr:
df %>%
mutate(lat = lat_deg + lat_min/60,
long = long_deg + long_min/60)
returns
site lat_deg lat_min long_deg long_min lat long
1 1 44 4.971 80 27.934 44.08285 80.46557
2 2 44 4.977 80 27.977 44.08295 80.46628
3 3 44 4.986 80 27.986 44.08310 80.46643
or simply
df$lat <- df$lat_deg + df$lat_min/60
df$long <- df$long_deg + df$long_min/60
I'm new to R, so please go easy on me... I have some longitudinal data that looks like
Basically, I'm trying to find a way to get a table with a) the number of unique cases that have all complete data and b) the number of unique cases that have at least one incomplete or missing data. The end results would ideally be
df<- df %>% group_by(Location)
df1<- df %>% group_by(any(Completion_status=='Incomplete' | 'Missing'))
Not sure about what you want, because it seems there are something of inconsistent between your request and the desired output, however lets try, it seems you need a kind of frequency table, that you can manage with basic R. At the bottom of the answer you can find some data similar to yours.
# You have two cases, the Complete, and the other, so here a new column about it:
data$case <- ifelse(data$Completion_status =='Complete','Complete', 'MorIn')
# now a frequency table about them: if you want a data.frame, here we go
result <- as.data.frame.matrix(table(data$Location,data$case))
# now the location as a new column rather than the rownames
result$Location <- rownames(result)
# and lastly a data.frame with the final results: note that you can change the names
# of the columns but if you want spaces maybe a tibble is better
result <- data.frame(Location = result$Location,
`Number.complete` = result$Complete,
`Number.incomplete.missing` = result$MorIn)
result
Location Number.complete Number.incomplete.missing
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
Or if you prefere a dplyr chain:
data %>%
mutate(case = ifelse(data$Completion_status =='Complete','Complete', 'MorIn')) %>%
do( as.data.frame.matrix(table(.$Location,.$case))) %>%
mutate(Location = rownames(.)) %>%
select(3,1,2) %>%
`colnames<-`(c("Location","Number of complete ", "Number of incomplete or"))
Location Number of complete Number of incomplete or
1 London 0 1
2 Los Angeles 0 1
3 Paris 3 1
4 Phoenix 0 2
5 Toronto 1 1
With data:
# here your data (next time try to put them in an usable way in the question)
data <- data.frame( ID = c("A1","A1","A2","A2","B1","C1","C2","D1","D2","E1"),
Location = c('Paris','Paris','Paris','Paris','London','Toronto','Toronto','Phoenix','Phoenix','Los Angeles'),
Completion_status = c('Complete','Complete','Incomplete','Complete','Incomplete','Missing',
'Complete','Incomplete','Incomplete','Missing'))
Sorry, I've tried my best but I didn't find the answer. As beginner, I'm not sure that I'm able to put the question clearly. Thanks in advance.
So I have a dataframe with data about consumption with 24000 rows.
In this dataframe, there is a series of variable about the number of objects bought within the last two months :
NumberOfCoat, NumberOfShirt, NumberOfPants, NumberOfShoes...
And there is a variable "profession" registered by number.
So now the data looks looks like this
profession NumberOfCoat NumberOfShirt NumberOfShoes
individu1 1 1 1 1
individu2 3 2 4 1
individu3 2 2 0 0
individu4 6 0 3 2
individu5 5 0 2 3
individu6 7 1 0 5
individu7 4 3 1 2
I would like to know the structure of consumption by profession and get something like this :
ProportionOfCoat ProportionOfShirt ProportionOfShoe...
profession1 0.3 0.5 0.1
profession2 0.1 0.2 0.4
profession3 0.2 0.6 0.1
profession4 0.1 0.1 0.2
I don't know if it is clear, but finally I want to be able to say :
10% of clothing products that doctors bought are Tshirts whereas 20% of what teachers bought are T-shirts.
And finally, I'd like to draw a stacked barplot where each stack is scaled to sum to 100%.
I suppose that we can you dplyr ?
Thank you very much !!
temp <- aggregate( . ~ profession, data=zzz, FUN=sum)
cbind(temp[1],temp[-1]/rowSums(temp[-1]))
or also using prop.table
As other people noted, it is always better to post a reproducible example, I´ll try to post one with my solution, which is longer than the ones already posted but, for the same reason, maybe clearer.
First you should create an example dataframe:
set.seed(10) # I set a seed cause I´ll use the sample() function
n <- 1:100 # vector from 1 to 100 to obtain the number of products bought
p <- 1:8 # vector for obtaining id of professions
profession <- sample(p,50, replace = TRUE)
NumberOfCoat <- sample(n,50, replace = TRUE)
NumberOfShirt <- sample(n,50, replace = TRUE)
NumberOfShoes <- sample(n,50, replace = TRUE)
df <- as.data.frame(cbind(profession, NumberOfCoat,
NumberOfShirt, NumberOfShoes))
Once you got the dataframe, you can explain what you have tried so far, or a possible solution. Here I used dplyr.
df <- df %>% group_by(profession) %>% summarize(coats = sum(NumberOfCoat),
shirts = sum(NumberOfShirt),
shoes = sum(NumberOfShoes)) %>%
mutate(tot_prod = coats + shirts + shoes,
ProportionOfCoat = coats/tot_prod,
ProportionOfShirt = shirts/tot_prod,
ProportionofShoes = shoes/tot_prod) %>%
select(profession, ProportionOfCoat, ProportionOfShirt,
ProportionofShoes)
dfcorresponds to the second dataframe you show, where you have the proportion of each product bought by each profession. In my example looks like this:
profession ProportionOfCoat ProportionOfShirt ProportionofShoes
<int> <dbl> <dbl> <dbl>
1 1 0.3910483 0.2343934 0.3745583
2 2 0.4069641 0.3525571 0.2404788
3 3 0.3330804 0.3968134 0.2701062
4 4 0.2740657 0.3952435 0.3306908
5 5 0.2573991 0.3784753 0.3641256
6 6 0.2293814 0.3543814 0.4162371
7 7 0.2245841 0.3955638 0.3798521
8 8 0.2861635 0.3490566 0.3647799
If you want to produce a stacked barplot, you have to reshape your data to a long format in order to be able to use ggplot2. As #alistaire noted, you can do it with the gather function from the tidyr package.
df <- df %>% gather(product, proportion, -profession)
And finally you can plot it with ggplot2.
ggplot(df, aes(x=profession, y=proportion, fill=product)) +
geom_bar(stat="identity")
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
background:
- dataframe with 60.000 lines
- 5 columns: pt/bi/sx/ex/re
- pt = subject; bi = birth; sx = sex; ex = exam (14 types); re = result of exam
> head(fim)
pct nasc sex exam res
1 ACF 11/09/1951 F ldl 81
2 ACF 11/09/1951 F colt 172
3 ACF 11/09/1951 F tg 152
4 ACF 11/09/1951 F ferr 28,1
5 ACF 11/09/1951 F fe 41
6 ACF 11/09/1951 F plq 256000
...
So.. as you can see, each subject has at least 14 rows corresponding to 14 exams with their results.
My problem is that I want to subset all patients and their set of exams based on a exam result. An example: I would like to have all subjects and their set of exams that has the exam1 == 15 or "positive".
Despite having tried several ways, the only solution I think is possible is through casting to wide format, selecting and reshaping again. BUT when I use the cast function, all values are changed:
library(reshape)
df_wide <- cast(df, pt~ex)
Long to wide works fine, but the original values are lost to new ones. Can anyone help me with that or has another idea on how I can subset it in another way?
> head(dfw)
pct hcv ldl colt cr ferr fe...
1 AFC R 73 157 9,56 1687,0 80
2 AAPS R 78 130 0,91 879,0 104
3 ASS R 96 151 0,76 666,2 138
4 ARS R 67 115 0,73 674,0 133
5 ARDS R 180 261 0,71 105,0 110
...
Solution:
keep <- dfw[dfw$exam == "hcv" & fim$res == "R", "pct"]
dfw = dfw[!duplicated(dfw), ]
subset_dfw <- filter(dfw, pct %in% keep)
subset_dfw %>% group_by(pct) %>% filter (!duplicated(exam))
You may want to consider dplyr library which allows very good options to manipulate data. For this task, you can try something like this:
library(dplyr)
df <- filter(df, ex == 'ex1' & re == 15)
If you want to do with base package, you can do something like this:
df <- df[df$ex == 'ex1' & df$re == 15, ]
Edit:
If the goal is to keep all rows for a patient as long as any one row has ex1 & 15, you can achieve that as follows:
library(dplyr)
ptToKeep <- filter(df, ex == 'ex1' & re == 15)$pt
df <- filter(df, pt %in% ptToKeep)
Or, with base as shown in the comment above:
ptToKeep <- df[df$ex == 'ex1' & df$re == 15, ]$pt
df <- df[pt %in% ptToKeep, ]
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I have a vector in R that is a factor list, a list of 256 nfl teams. I need to change every team name from "Washington Redskins" into "WAS" or "New England Patriots" into "NE". What is the best technique for this type of problem. I'm sure this is something easy so don't beat me up on this one.
You could read the acronyms from a web page and match the team names against yours.
Here's one example.
library(XML)
tab <- readHTMLTable("http://sportsdelve.wordpress.com/abbreviations/")[[1]]
head(tab)
# V1 V2
# 1 ARZ Arizona Cardinals
# 2 ATL Atlanta Falcons
# 3 BAL Baltimore Ravens
# 4 BALC Baltimore Colts
# 5 BCLT Baltimore Colts (1950)
# 6 BALCLT Baltimore Colts (AAFC)
And you can use regular expression matching to find your teams...
tab[grepl("WAS|NE", tab[[1]]), ]
# V1 V2
# 38 NE New England Patriots
# 58 WAS Washington Redskins
One way is to have a dictionary, i.e. a file with each full name and each short name. You can then match this file to your full names, using the full names as the ID for the match.
Example:
full.names <- data.frame(full = c("wash", "wash", "denv", "denv", "wash")) ## needs to be a data frame in order for plyr::join to work
dic <- data.frame(full = c("wash", "denv"), short = c("ww", "dd")) ## the dictionary; one row per unique name
matched <- plyr::join(x = full.names, y = dic, by = "full") ## using join from the plyr package
Output:
full short
1 wash ww
2 wash ww
3 denv dd
4 denv dd
5 wash ww
'merge' command also works: (Using Chaconne's data here)
full.names <- data.frame(full = c("wash", "wash", "denv", "denv", "wash"))
dic <- data.frame(full = c("wash", "denv"), short = c("ww", "dd"))
merge(full.names,dic)
full short
1 denv dd
2 denv dd
3 wash ww
4 wash ww
5 wash ww
You can just change the levels directly
levels(team)
will list the order of the levels assigned to your factor
levels(team) <- c("ARZ","ATL", ...)
will change the labels.