R: Cast function returning wrong values [closed] - r

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 7 years ago.
Improve this question
background:
- dataframe with 60.000 lines
- 5 columns: pt/bi/sx/ex/re
- pt = subject; bi = birth; sx = sex; ex = exam (14 types); re = result of exam
> head(fim)
pct nasc sex exam res
1 ACF 11/09/1951 F ldl 81
2 ACF 11/09/1951 F colt 172
3 ACF 11/09/1951 F tg 152
4 ACF 11/09/1951 F ferr 28,1
5 ACF 11/09/1951 F fe 41
6 ACF 11/09/1951 F plq 256000
...
So.. as you can see, each subject has at least 14 rows corresponding to 14 exams with their results.
My problem is that I want to subset all patients and their set of exams based on a exam result. An example: I would like to have all subjects and their set of exams that has the exam1 == 15 or "positive".
Despite having tried several ways, the only solution I think is possible is through casting to wide format, selecting and reshaping again. BUT when I use the cast function, all values are changed:
library(reshape)
df_wide <- cast(df, pt~ex)
Long to wide works fine, but the original values are lost to new ones. Can anyone help me with that or has another idea on how I can subset it in another way?
> head(dfw)
pct hcv ldl colt cr ferr fe...
1 AFC R 73 157 9,56 1687,0 80
2 AAPS R 78 130 0,91 879,0 104
3 ASS R 96 151 0,76 666,2 138
4 ARS R 67 115 0,73 674,0 133
5 ARDS R 180 261 0,71 105,0 110
...
Solution:
keep <- dfw[dfw$exam == "hcv" & fim$res == "R", "pct"]
dfw = dfw[!duplicated(dfw), ]
subset_dfw <- filter(dfw, pct %in% keep)
subset_dfw %>% group_by(pct) %>% filter (!duplicated(exam))

You may want to consider dplyr library which allows very good options to manipulate data. For this task, you can try something like this:
library(dplyr)
df <- filter(df, ex == 'ex1' & re == 15)
If you want to do with base package, you can do something like this:
df <- df[df$ex == 'ex1' & df$re == 15, ]
Edit:
If the goal is to keep all rows for a patient as long as any one row has ex1 & 15, you can achieve that as follows:
library(dplyr)
ptToKeep <- filter(df, ex == 'ex1' & re == 15)$pt
df <- filter(df, pt %in% ptToKeep)
Or, with base as shown in the comment above:
ptToKeep <- df[df$ex == 'ex1' & df$re == 15, ]$pt
df <- df[pt %in% ptToKeep, ]

Related

Converting XML to dataframe [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed last year.
Improve this question
I want to convert a XML to a dataframe.
I'm aware of XML::xmlToDataFrame, but it gives an error in my case.
The XML can be found here:
https://api.data.gov.hk/v1/historical-archive/get-file?url=https%3A%2F%2Fresource.data.one.gov.hk%2Ftd%2Ftraffic-detectors%2FrawSpeedVol-all.xml&time=20211216-0513
Thanks for all answers!
Since your XML file contains multiple nested children, XML::xmlToDataFrame was giving out error.
I've approached the problem using the naive method but it works!
Here's what I've done:
The following code creates a dataframe with the tags inside `'.
library(xml2)
require(XML)
pg <- read_xml("https://s3-ap-southeast-1.amazonaws.com/historical-resource-archive/2021/12/16/https%253A%252F%252Fresource.data.one.gov.hk%252Ftd%252Ftraffic-detectors%252FrawSpeedVol-all.xml/0513")
records <- xml_find_all(pg, "//lane")
nodenames<-xml_name(xml_children(records))
nodevalues<-trimws(xml_text(xml_children(records)))
lane_id <- nodevalues[seq(1, length(nodevalues), 6)]
speed <- nodevalues[seq(2, length(nodevalues), 6)]
occupancy <- nodevalues[seq(3, length(nodevalues), 6)]
volume <- nodevalues[seq(4, length(nodevalues), 6)]
s.d. <- nodevalues[seq(5, length(nodevalues), 6)]
valid <- nodevalues[seq(6, length(nodevalues), 6)]
df <- data.frame(lane_id, speed, occupancy, volume, s.d., valid)
head(df)
The df looks like this:
lane_id speed occupancy volume s.d. valid
1 Fast Lane 70 0 0 0 Y
2 Middle Lane 76 6 3 11.1 Y
3 Slow Lane 70 6 0 0 Y
4 Fast Lane 82 1 1 0 Y
5 Middle Lane 63 3 1 0 Y
6 Slow Lane 79 2 1 0 Y
If you want to extract the data of <detectors>, you can use the following code:
################ Extract Detector Data #########
records2 <- xml_find_all(pg, "//detector")
vals2 <- trimws(xml_text(records2))
nodenames2 <-xml_name(xml_children(records2))
nodevalues2 <-trimws(xml_text(xml_children(records2)))
detector_id <- nodevalues2[seq(1, length(nodevalues2), 3)]
direction <- nodevalues2[seq(2, length(nodevalues2), 3)]
lanes <- nodevalues2[seq(3, length(nodevalues2), 3)]
df2 <- data.frame(detector_id, direction, lanes)
head(df2)
The df2 looks like this:
detector_id direction lanes
1 AID01101 South East Fast Lane70000YMiddle Lane766311.1YSlow Lane70600Y
2 AID01102 North East Fast Lane82110YMiddle Lane63310YSlow Lane79210Y
3 AID01103 South East Fast Lane50000YMiddle Lane65210YSlow Lane192310Y
4 AID01104 North East Fast Lane50000YSlow Lane63110Y
5 AID01105 North East Fast Lane50100YSlow Lane53410Y
6 AID01106 South East Fast Lane50300YSlow Lane56510Y
But, as you can notice, the lanes column isn't cleaned as you would like since it is a grandchild tag inside the XML.
Although, you could create a new data frame from df and df2 as you would like.

R: is it possible to convert a knitr::kable to dataframe?

I am using the pivot function from the lessR package, to create an Excel-like pivot table with two categorical variables that make up the vertical and horizontal categories, and a mean in each cell. (Hope this makes sense).
I followed the code that the documentation (https://cran.r-project.org/web/packages/lessR/vignettes/pivot.html) gives. Let's follow their example:
d <- Read("Employee")
a <- pivot(d, mean, Salary, Dept, Gender)
The data d is like this:
Years Gender Dept Salary JobSat Plan Pre Post
Ritchie, Darnell 7 M ADMN 53788.26 med 1 82 92
Wu, James NA M SALE 94494.58 low 1 62 74
Hoang, Binh 15 M SALE 111074.86 low 3 96 97
Jones, Alissa 5 F <NA> 53772.58 <NA> 1 65 62
Downs, Deborah 7 F FINC 57139.90 high 2 90 86
Afshari, Anbar 6 F ADMN 69441.93 high 2 100 100
Knox, Michael 18 M MKTG 99062.66 med 3 81 84
Campagna, Justin 8 M SALE 72321.36 low 1 76 84
Kimball, Claire 8 F MKTG 61356.69 high 2 93 92
The pivottable a is a nice table, exactly as I want it to look in terms of cell contents, etc. It appears to be a knitr_kable.
Gender F M
Dept
------- --------- ---------
ACCT 63237.16 59626.20
ADMN 81434.00 80963.35
FINC 57139.90 72967.60
MKTG 64496.02 99062.66
SALE 64188.25 86150.97
Next, I would like to make a dataframe out of this, for easier manipulation in my code and for copying it to the clipboard. However, I don't know how to convert a knitr_kable to a dataframe. Here is my code and the error it results in:
as.data.frame(a)
Error in as.data.frame.default(a) :
cannot coerce class ‘"knitr_kable"’ to a data.frame
The knitr-documentation does not say anything about this conversion - it is only about converting a dataframe to a knitr_kable, which is the opposite of what I want.
I have also tried pivottabler, but this has similar issues: the resulting class cannot be coerced to a dataframe either.
Here are two potential answers:
Most direct: Wrangle the data yourself
If you're open to a tidyverse-style approach, it only takes a few lines to do the wrangling and summarising yourself. That will give you a datatable output that you can work with right away.
# load packages
library(lessR)
library(dplyr)
library(tidyr)
# load data
d <- Read("Employee")
# use tidyverse-style code to pivot and summarise the data yourself
d %>%
group_by(Gender, Dept) %>%
summarise(Salary_mean = mean(Salary)) %>%
pivot_wider(names_from= "Gender", values_from = "Salary_mean")
Read the knitr::kable() markdown output into a data frame
If you prefer to work backwards from a knitr::kable() output to a dataframe, this is addressed in this SO question: Markdown table to data frame in R

efficient way to match and sum variables of two data frames based on two criteria [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 4 years ago.
I have a data frame df1 on import data for 397 different industries over 17 years and several different exporting countries/ regions.
> head(df1)
year importer exporter imports sic87dd
2300 1991 USA CAN 9.404848e+05 2011
2301 1991 USA CAN 2.259720e+04 2015
2302 1991 USA CAN 5.459608e+02 2021
2303 1991 USA CAN 1.173237e+04 2022
2304 1991 USA CAN 2.483033e+04 2023
2305 1991 USA CAN 5.353975e+00 2024
However, I want the sum of all imports for a given industry and a given year, regardless of where they came from. (The importer is always the US, sic87dd is a code that uniquely identifies the 397 industries)
So far I have tried the following code, which works correctly but is terribly inefficient and takes ages to run.
sic87dd <- unique(df1$sic87dd)
year <- unique (df1$year)
df2 <- data.frame("sic87dd" = rep(sic87dd, each = 17), "year" = rep(year, 397), imports = rep(0, 6749))
i <- 1
j <- 1
while(i <= nrow(df2)){
while(j <= nrow(df1)){
if((df1$sic87dd[j] == df2$sic87dd[i]) == TRUE & (df1$year[j] == df2$year[i]) == TRUE){
df2$imports[i] <- df2$imports[i] + df1$imports[j]
}
j <- j + 1
}
i <- i + 1
j <- 1
}
Is there a more efficient way to do this? I have seen some questions here that were somewhat similar and suggested the use of the data.table package, but I can't figure out how to make it work in my case.
Any help is appreciated.
There is a simple solution using dplyr:
First, you'll need to set your industry field as a factor (I'm assuming this entire field consists of a 4 digit number):
df1$sic87dd <- as.factor(df1$sic87dd)
Next, use the group_by command and summarise:
df1 %>%
group_by(sic87dd) %>%
summarise(total_imports = sum(imports))

R: Using different DFs to get third DF with specific info from first 2

I have two data frames, df1 has information about a publication's year, outlet name, total articles in this publication in a year, and a cumulative sum of articles over the period of time I'm studying. df2 has a random sample of article IDs, with potential values ranging from 1 to the total number of articles given by df1$cumsum.
What I need to do is to grab each article ID in df2 and identify in which publication and year it falls under, using the information contained in df1.
Here's a minimally reproducible example:
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2009, 2000:2009)
df1$outlet <- c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,2,2,2,2,2,2,2,2,2)
df1$article_total <- sample(1:200, 20, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_num <- sample(1:2102, 100, replace = T) # get random sample of article IDs for the total number of articles I have in this db
df2 <- as.data.frame(df2)
Ideally, I would also like to calculate an article's ID in each year. For example, in the data above, outlet 1 has 14 articles in the year 2000 and 168 in 2001 (cumsum = 183). If I have an article ID of 156, I would like to know that it is the 142th article in the year 2001 of publication 1. And so on and so forth for every article ID I have in this database.
I was thinking I should do this with a for loop, but I'm 100% lost in writing it. Here's what I began writing, but I have a feeling I'm not on the right track with it:
for i in 1:nrow(df2$art_num){
article_number <- df2$art_num[i]
if (article_number %in% df1$cumsum){ # note: cumsum should be an interval before doing this?
# get article number, year, publication in new df
# also calculate article ID in each year/publication
}
}
Thanks in advance for any help! I'm still lost with writing loops in R...
#######################
EDITED EXAMPLE as per Frank's suggestion
set.seed(890)
df1 <- NULL
df1$year <- c(2000:2002, 2000:2002)
df1$outlet <- c(1, 1, 1, 2,2,2)
df1$article_total <- sample(1:50, 6, replace = T)
df1$cumsum <- cumsum(df1$article_total)
df1 <- as.data.frame(df1)
df2 <- NULL
df2$art_id <- c(66, 120, 77, 156, 24)
df2 <- as.data.frame(df2)
Here's the output I'm looking for:
art_id outlet year article_number
1 66 1 2002 19
2 120 2 2000 35
3 77 1 2002 30
4 156 2 2001 35
5 24 1 2000 20
This example shows my ideal output in df3, which I calculated/built by hand. It has one column with the article's ID, the appropriate outlet, the year, and a new variable art_number. This is different than the article ID in that I calculated it from df1$cumsum and df3$art_id. In this example, the first row shows that the first article in my database has an ID of 66. I obtain a art_number value of 19 because this article (id = 66) is the 19th article published in the year 2002 by outlet 1. I calculated this value by looking at the article ID, locating the year and outlet based on the df1$cumsum, and then substracting the art_id value from the df1$cumsum value for the previous year. So for this specific article, I calculated df3$art_number = df3$art_id[1,1] - df1$cumsum[2,4]
I need to do this calculation for every article in my data base so I don't do this process by hand forever.
I think your data structure makes sense, though it would be easier with one additional column, for the first article in a year and outlet:
library(data.table)
setDT(df1); setDT(df2)
df1[, art_cstart := shift(cumsum(article_total), fill=0L) + 1L]
year outlet article_total cumsum art_cstart
1: 2000 1 4 4 1
2: 2001 1 43 47 5
3: 2002 1 38 85 48
4: 2000 2 36 121 86
5: 2001 2 39 160 122
6: 2002 2 8 168 161
Now, we can do a rolling update join, "rolling" each art_id to the previous cumsum and computing each desired column:
df2[, c("outlet", "year", "art_num") := df1[df2, on=.(cumsum = art_id), roll=-Inf, .(
x.year,
x.outlet,
i.art_id - x.art_cstart + 1L
)]]
art_id outlet year art_num
1: 66 2002 1 19
2: 120 2000 2 35
3: 77 2002 1 30
4: 156 2001 2 35
5: 24 2001 1 20
How it works
x[i, on=, roll=, j] is the syntax for a join, looking up each row of i in x.
In this join j evaluates to a list of columns, .(...) shorthand for list(...).
Column assignment is done with (colnames) := .(...).
The assignment is to the existing table df2 instead of unnecessarily creating a new table.
For details on how data.table syntax works, see the startup messages...
> library(data.table)
data.table 1.10.4
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
This is the code you need I think:
df3 <- data.frame(matrix(ncol = 3, nrow = 0))
colnames(df3) <- c("articleNumber", "year", "publication")
for(i in 1:nrow(df2$art_num)){
for(j in 1:nrow(df1$cumsum)) {
if ((df2$art_num[i] >= df1$cumsum[j]) && (df2$art_num[i] <= df1$cumsum[j + 1])){
# note: cumsum should be an interval before doing this? NOT REALLY SURE
# WHAT YOU NEED HERE
# get article number, year, publication in new df
df3[i, 1] <- df2$art_num[i]
df3[i, 2] <- df1$year[j]
df3[i, 3] <- df1$outlet[j]
# also calculate article ID in each year/publication ISN'T THIS
# art_num?
}
}

summarize data from csv using R

I'm new to R, and I wrote some code to summarize data from .csv file according to my needs.
here is the code.
raw <- read.csv("trees.csv")
looks like this
SNAME CNAME FAMILY PLOT INDIVIDUAL CAP H
1 Alchornea triplinervia (Spreng.) M. Arg. Tainheiro Euphorbiaceae 5 176 15 9.5
2 Andira fraxinifolia Benth. Angelim Fabaceae 3 321 12 6.0
3 Andira fraxinifolia Benth. Angelim Fabaceae 3 326 14 7.0
4 Andira fraxinifolia Benth. Angelim Fabaceae 3 327 18 5.0
5 Andira fraxinifolia Benth. Angelim Fabaceae 3 328 12 6.0
6 Andira fraxinifolia Benth. Angelim Fabaceae 3 329 21 7.0
#add 2 other rows
for (i in 1:nrow(raw)) {
raw$VOLUME[i] <- treeVolume(raw$CAP[i],raw$H[i])
raw$BASALAREA[i] <- treeBasalArea(raw$CAP[i])
}
#here comes.
I need a new data frame, with the mean of columns H and CAP and the sums of columns VOLUME and BASALAREA. This dataframe is grouped by column SNAME and subgrouped by column PLOT.
plotSummary = merge(
aggregate(raw$CAP ~ raw$SNAME * raw$PLOT, raw, mean),
aggregate(raw$H ~ raw$SNAME * raw$PLOT, raw, mean))
plotSummary = merge(
plotSummary,
aggregate(raw$VOLUME ~ raw$SNAME * raw$PLOT, raw, sum))
plotSummary = merge(
plotSummary,
aggregate(raw$BASALAREA ~ raw$SNAME * raw$PLOT, raw, sum))
The functions treeVolume and treeBasal area just return numbers.
treeVolume <- function(radius, height) {
return (0.000074230*radius**1.707348*height**1.16873)
}
treeBasalArea <- function(radius) {
return (((radius**2)*pi)/40000)
}
I'm sure that there is a better way of doing this, but how?
I can't manage to read your example data in, but I think I've made something that generally represents it...so give this a whirl. This answer builds off of Greg's suggestion to look at plyr and the functions ddply to group by segments of your data.frame and numcolwise to calculate your statistics of interest.
#Sample data
set.seed(1)
dat <- data.frame(sname = rep(letters[1:3],2), plot = rep(letters[1:3],2),
CAP = rnorm(6),
H = rlnorm(6),
VOLUME = runif(6),
BASALAREA = rlnorm(6)
)
#Calculate mean for all numeric columns, grouping by sname and plot
library(plyr)
ddply(dat, c("sname", "plot"), numcolwise(mean))
#-----
sname plot CAP H VOLUME BASALAREA
1 a a 0.4844135 1.182481 0.3248043 1.614668
2 b b 0.2565755 3.313614 0.6279025 1.397490
3 c c -0.8280485 1.627634 0.1768697 2.538273
EDIT - response to updated question
Ok - now that your question is more or less reproducible, here's how I'd approach it. First of all, you can take advantage of the fact that R is a vectorized meaning that you can calculate ALL of the values from VOLUME and BASALAREA in one pass, without looping through each row. For that bit, I recommend the transform function:
dat <- transform(dat, VOLUME = treeVolume(CAP, H), BASALAREA = treeBasalArea(CAP))
Secondly, realizing that you intend to calculate different statistics for CAP & H and then VOLUME & BASALAREA, I recommend using the summarize function, like this:
ddply(dat, c("sname", "plot"), summarize,
meanCAP = mean(CAP),
meanH = mean(H),
sumVOLUME = sum(VOLUME),
sumBASAL = sum(BASALAREA)
)
Which will give you an output that looks like:
sname plot meanCAP meanH sumVOLUME sumBASAL
1 a a 0.5868582 0.5032308 9.650184e-06 7.031954e-05
2 b b 0.2869029 0.4333862 9.219770e-06 1.407055e-05
3 c c 0.7356215 0.4028354 2.482775e-05 8.916350e-05
The help pages for ?ddply, ?transform, ?summarize should be insightful.
Look at the plyr package. I will split the data by the SNAME variable for you, then you give it code to do the set of summaries that you want (mixing mean and sum and whatever), then it will put the pieces back together for you. You probably want either the 'ddply' or the 'daply' function in that package.

Resources