I have a data frame (tab3) looking like this:
CWRES ID AGE BMI WGT
3 0.59034000 1 37.5 20.7 64.6
4 1.81300000 1 37.5 20.7 64.6
5 1.42920000 1 37.5 20.7 64.6
6 0.59194000 1 37.5 20.7 64.6
7 0.30886000 1 37.5 20.7 64.6
8 -0.14601000 1 37.5 20.7 64.6
9 -0.19776000 1 37.5 20.7 64.6
10 0.74208000 1 37.5 20.7 64.6
11 -0.69280000 1 37.5 20.7 64.6
38 -2.42900000 1 37.5 20.7 64.6
39 -0.25732000 1 37.5 20.7 64.6
40 -0.49689000 1 37.5 20.7 64.6
41 -0.11556000 1 37.5 20.7 64.6
42 0.91036000 1 37.5 20.7 64.6
43 -0.24766000 1 37.5 20.7 64.6
44 -0.14962000 1 37.5 20.7 64.6
45 -0.45651000 1 37.5 20.7 64.6
48 0.53237000 2 58.5 23.0 53.4
49 -0.53284000 2 58.5 23.0 53.4
50 -0.33086000 2 58.5 23.0 53.4
51 -0.56355000 2 58.5 23.0 53.4
52 0.00883120 2 58.5 23.0 53.4
53 -1.00650000 2 58.5 23.0 53.4
80 0.85810000 2 58.5 23.0 53.4
81 -0.71715000 2 58.5 23.0 53.4
82 0.44346000 2 58.5 23.0 53.4
83 1.09890000 2 58.5 23.0 53.4
84 0.98726000 2 58.5 23.0 53.4
85 0.19667000 2 58.5 23.0 53.4
86 -1.32570000 2 58.5 23.0 53.4
89 -4.56920000 3 43.5 26.7 66.2
90 0.75174000 3 43.5 26.7 66.2
91 0.40935000 3 43.5 26.7 66.2
92 0.18340000 3 43.5 26.7 66.2
93 0.27399000 3 43.5 26.7 66.2
94 -0.23596000 3 43.5 26.7 66.2
95 -1.59460000 3 43.5 26.7 66.2
96 -0.03708900 3 43.5 26.7 66.2
97 0.68750000 3 43.5 26.7 66.2
98 -0.47979000 3 43.5 26.7 66.2
125 2.23200000 3 43.5 26.7 66.2
126 0.90470000 3 43.5 26.7 66.2
127 -0.34493000 3 43.5 26.7 66.2
128 -0.02114400 3 43.5 26.7 66.2
129 -1.08830000 3 43.5 26.7 66.2
130 -0.33937000 3 43.5 26.7 66.2
131 1.19820000 3 43.5 26.7 66.2
132 0.81653000 3 43.5 26.7 66.2
133 1.61810000 3 43.5 26.7 66.2
134 0.42914000 3 43.5 26.7 66.2
135 -1.03150000 3 43.5 26.7 66.2
...
I want to plot the variable CWRES versus ID, AGE, BMI and WGT. To do this I use this code:
library(ggplot2)
plotloop <- function(x, na.rm = TRUE, ...) {
nm <- names(x)
for (i in seq_along(nm)) {
print(ggplot(x,aes_string(x = nm[i], y = nm[1])) +
geom_point()) }
}
plotloop(tab3)
However, it also plots CWRES vs CWRES and I do not want to plot CWRES vs CWRES.
What should I do?
Thanks in advance,
Mario
The loop may not be the best way to go about plotting multiple plots with ggplot. Hence I will not try to fix your code but suggest an alternative route.
You should first melt your data.frame to transform it to long format with only retaining the CWRES variable:
require(reshape2)
mDf <- melt(x, c("CWRES"))
Now you can create your plots as follows:
g <- ggplot(mDF,aes(x=CWRES,y=value))
g <- g + geom_point()
g <- g + facet_grid(.~variable)
g
This creates a faceted plot with the four scatter plots next to each other.
If you really want to plot multiple i would proceed as follows (based on the formatting above):
variables <- unique(mDF$variable)
for (v in variables)
{
print(ggplot(mDF[mDF$variable==v,],aes(x=CWRES,y=value)) + geom_point() )
}
Finally I found the solution:
This code works great!
for(i in names(tab3)[2:5]) {
df2 <- tab3[, c(i, "CWRES")]
print(ggplot(tab3) + geom_point(aes_string(x = i, y = "CWRES")) + theme_bw())
}
Related
I am trying to reshape my data to long instead of wide format using the same code provided earlier link , however it doesn't work even after several trials to modify names_pattern = "(.*)_(pre|post.*)",
My data sample is
data1<-read.table(text="
Serial_ID pre_EDV pre_ESV pre_LVEF post_EDV post_ESV post_LVEF
1 76.2 32.9 56.8 86.3 36.6 57.6
2 65.4 35.9 45.1 60.1 26.1 56.7
3 64.4 35.1 45.5 72.5 41.1 43.3
4 50 13.9 72.1 46.4 18.4 60.4
5 89.6 32 64.3 70.9 19.3 72.8
6 62 20.6 66.7 55.9 17.8 68.2
7 91.2 37.7 58.6 61.9 23.8 61.6
8 62 24 61.3 69.3 34.9 49.6
9 104.1 22.7 78.8 38.6 11.5 70.1
10 90.6 31.2 65.6 48 16.1 66.4", sep="", header=T)
I want to reshape my data to
put identical column headings below each other eg post_EDV below
pre_EDV
Create new column Pre vs. post
Fix column heading (remove "pre_" and "post_" to be "EDV" only (as shown in the screenshot below)).
This is the used code:
library(dplyr);library(tidyr);library(stringr)
out <- data %>% pivot_longer(cols = -Serial_ID,
names_to = c(".value", "prevspost"),
names_pattern = "(.*)_(pre|post.*)",
names_sep="_") #%>% as.data.frame
Also I tried names_prefix = c("pre_","post_") instead of names_pattern = "(.*)_(pre|post.*)", but it doesn't work.
Any advice will be greatly appreciated.
Edit I recommend using #Dave2e's superior approach.
The reason your attempt didn't work is because the pattern has to match in order. You could try this:
library(tidyr)
library(dplyr)
data1 %>% pivot_longer(cols = -Serial_ID,
names_to = c("prevspost",".value"),
names_pattern = "(pre|post)_(\\w+)") %>%
dplyr::arrange(desc(prevspost),Serial_ID)
# A tibble: 20 x 5
Serial_ID prevspost EDV ESV LVEF
<int> <chr> <dbl> <dbl> <dbl>
1 1 pre 76.2 32.9 56.8
2 2 pre 65.4 35.9 45.1
3 3 pre 64.4 35.1 45.5
4 4 pre 50 13.9 72.1
5 5 pre 89.6 32 64.3
6 6 pre 62 20.6 66.7
7 7 pre 91.2 37.7 58.6
8 8 pre 62 24 61.3
9 9 pre 104. 22.7 78.8
10 10 pre 90.6 31.2 65.6
11 1 post 86.3 36.6 57.6
12 2 post 60.1 26.1 56.7
13 3 post 72.5 41.1 43.3
14 4 post 46.4 18.4 60.4
15 5 post 70.9 19.3 72.8
16 6 post 55.9 17.8 68.2
17 7 post 61.9 23.8 61.6
18 8 post 69.3 34.9 49.6
19 9 post 38.6 11.5 70.1
20 10 post 48 16.1 66.4
Your initial approach very close, it needed some simplification. Use only "names_sep" or "names_pattern"
library(tidyr)
library(dplyr)
data1 %>% pivot_longer(cols = -Serial_ID,
names_to = c("Pre vs. post", '.value'),
names_sep="_")
# A tibble: 20 x 5
Serial_ID `Pre vs. post` EDV ESV LVEF
<int> <chr> <dbl> <dbl> <dbl>
1 1 pre 76.2 32.9 56.8
2 1 post 86.3 36.6 57.6
3 2 pre 65.4 35.9 45.1
4 2 post 60.1 26.1 56.7
5 3 pre 64.4 35.1 45.5
6 3 post 72.5 41.1 43.3
7 4 pre 50 13.9 72.1
8 4 post 46.4 18.4 60.4
9 5 pre 89.6 32 64.3
10 5 post 70.9 19.3 72.8
11 6 pre 62 20.6 66.7
12 6 post 55.9 17.8 68.2
13 7 pre 91.2 37.7 58.6
14 7 post 61.9 23.8 61.6
15 8 pre 62 24 61.3
16 8 post 69.3 34.9 49.6
17 9 pre 104. 22.7 78.8
18 9 post 38.6 11.5 70.1
19 10 pre 90.6 31.2 65.6
20 10 post 48 16.1 66.4
try this:
library(dplyr);library(tidyr);library(stringr)
out <- data1 %>% pivot_longer(-Serial_ID,
names_to = c("measurement", "names"),
values_to = "values",
names_sep = "_")
out
# # A tibble: 60 x 4
# Serial_ID measurement names values
# <int> <chr> <chr> <dbl>
# 1 1 pre EDV 76.2
# 2 1 pre ESV 32.9
# 3 1 pre LVEF 56.8
# 4 1 post EDV 86.3
# 5 1 post ESV 36.6
# 6 1 post LVEF 57.6
# 7 2 pre EDV 65.4
# 8 2 pre ESV 35.9
# 9 2 pre LVEF 45.1
# 10 2 post EDV 60.1
# # ... with 50 more rows
Your code snipped passed the object "data" instead of "data1" into the pipe which produced an error:
"Error: No tidyselect variables were registered".
I have a matrix (X), in which I am trying to calculate the mean of each row. I know that rowMeans() can be used and will work just fine, however, I am trying to prove that a for loop can be used as well. How would I accomplish this?
X <- matrix(1:100, nrow = 25, ncol = 4)
for (i in 1:n) {
nums[i] <- mean(X)
}
print(nums[i])
[1] 50.5
VS.
rowMeans(X)
[1] 38.5 39.5 40.5 41.5 42.5 43.5 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5 52.5 53.5 54.5 55.5 56.5 57.5 58.5 59.5 60.5 61.5 62.5
Just turning #David's comment into an answer.
row_m <- vector("list", nrow(X))
for (i in 1:nrow(X)) {
row_m[i]<- mean(X[i,])
}
unlist(row_m)
# [1] 38.5 39.5 40.5 41.5 42.5 43.5 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5
# [15] 52.5 53.5 54.5 55.5 56.5 57.5 58.5 59.5 60.5 61.5 62.5
I have raw data shown below. I'm trying to move a row of data that corresponds to a label it matches to a new location in the dataframe.
dat<-read.table(text='RowLabels col1 col2 col3 col4 col5 col6
L 24363.7 25944.9 25646.1 25335.4 23564.2 25411.5
610 411.4 439 437.3 436.9 420.7 516.9
1 86.4 113.9 103.5 113.5 80.3 129
2 102.1 99.5 96.3 100.4 99.5 86
3 109.7 102.2 100.2 112.9 92.3 123.8
4 88.9 87.1 103.6 102.5 93.6 134.1
5 -50.3 -40.2 -72.3 -61.4 -27 -22.7
6 -35.3 -9.3 25.3 -0.3 15.6 -27.3
7 109.9 85.8 80.7 69.3 66.4 94
181920 652.9 729.2 652.1 689.1 612.5 738.4
1 104.3 107.3 103.5 104.2 98.3 110.1
2 103.6 102.6 100.1 103.2 88.8 117.7
3 53.5 99.1 46.7 70.3 53.9 32.5
4 93.5 107.2 98.3 99.3 97.3 121.1
5 96.8 109.3 104 102.2 98.7 112.9
6 103.6 96.9 104.7 104.4 91.5 137.7
7 97.6 106.8 94.8 105.5 84 106.4
181930 732.1 709.6 725.8 729.5 554.5 873.1
1 118.4 98.8 102.3 102 101.9 115.8
2 96.7 103.3 104.6 105.2 81.9 128.7
3 96 98.2 99.4 97.9 69.8 120.6
4 100.7 101 103.6 106.6 59.6 136.2
5 106.1 103.4 104.7 104.8 76.1 131.8
6 105 102.1 103 108.3 81 124.7
7 109.2 102.8 108.2 104.7 84.2 115.3
N 3836.4 4395.8 4227.3 4567.4 4009.9 4434.6
610 88.1 96.3 99.6 92 90 137.6
1 88.1 96.3 99.6 92 90 137.6
181920 113.1 100.6 106.5 104.2 87.3 108.2
1 113.1 100.6 106.5 104.2 87.3 108.2
181930 111.3 99.1 104.5 115.5 103.6 118.8
1 111.3 99.1 104.5 115.5 103.6 118.8
',header=TRUE)
I want to match the values of the three N-prefix labels: 610, 181920 and 181930 with its corresponding L-prefix labels. Basically move that row of data into the L-prefix as a new row, labeled 0 or 8 for example. So, the result for label, 610 would look like:
RowLabels col1 col2 col3 col4 col5 col6
610 411.4 439 437.3 436.9 420.7 516.9
1 86.4 113.9 103.5 113.5 80.3 129
2 102.1 99.5 96.3 100.4 99.5 86
3 109.7 102.2 100.2 112.9 92.3 123.8
4 88.9 87.1 103.6 102.5 93.6 134.1
5 -50.3 -40.2 -72.3 -61.4 -27 -22.7
6 -35.3 -9.3 25.3 -0.3 15.6 -27.3
7 109.9 85.8 80.7 69.3 66.4 94
8 88.1 96.3 99.6 92 90 137.6
Is this possible? I tried searching and I found some resources pointing toward dplyr or tidyr or aggregate. But I can't find a good example that matches my case. How to combine rows based on unique values in R? and
Aggregate rows by shared values in a variable
library(dplyr)
library(zoo)
df <- dat %>%
filter(grepl("^\\d+$",RowLabels)) %>%
mutate(RowLabels_temp = ifelse(grepl("^\\d{3,}$",RowLabels), as.numeric(as.character(RowLabels)), NA)) %>%
na.locf() %>%
select(-RowLabels) %>%
distinct() %>%
group_by(RowLabels_temp) %>%
mutate(RowLabels_indexed = row_number()-1) %>%
arrange(RowLabels_temp, RowLabels_indexed) %>%
mutate(RowLabels_indexed = ifelse(RowLabels_indexed==0, RowLabels_temp, RowLabels_indexed)) %>%
rename(RowLabels=RowLabels_indexed) %>%
data.frame()
df <- df %>% select(-RowLabels_temp)
df
Output is
col1 col2 col3 col4 col5 col6 RowLabels
1 411.4 439.0 437.3 436.9 420.7 516.9 610
2 86.4 113.9 103.5 113.5 80.3 129.0 1
3 102.1 99.5 96.3 100.4 99.5 86.0 2
4 109.7 102.2 100.2 112.9 92.3 123.8 3
5 88.9 87.1 103.6 102.5 93.6 134.1 4
6 -50.3 -40.2 -72.3 -61.4 -27.0 -22.7 5
7 -35.3 -9.3 25.3 -0.3 15.6 -27.3 6
8 109.9 85.8 80.7 69.3 66.4 94.0 7
9 88.1 96.3 99.6 92.0 90.0 137.6 8
...
It sounds like you want to use the match() function, for example:
target<-c(the values of your target order)
df<-df[match(target, df$column_to_reorder),]
http://www.aqistudy.cn/historydata/daydata.php?city=%E8%8B%8F%E5%B7%9E&month=201504
This is the website from with I want to read data.
My code is as follows,
library(XML)
fileurl <- "http://www.aqistudy.cn/historydata/daydata.php?city=苏州&month=201404"
doc <- htmlTreeParse(fileurl, useInternalNodes = TRUE, encoding = "utf-8")
rootnode <- xmlRoot(doc)
pollution <- xpathSApply(rootnode, "/td", xmlValue)
But I got a lot of messy code, and I don't know how to fix this problem.
I appreciate for any help!
This can be simplified using library(rvest) to directly read the table
library(rvest)
url <- "http://www.aqistudy.cn/historydata/daydata.php?city=%E8%8B%8F%E5%B7%9E&month=201504"
doc <- read_html(url) %>%
html_table()
doc[[1]]
# 日期 AQI 范围 质量等级 PM2.5 PM10 SO2 CO NO2 O3 排名
# 1 2015-04-01 106 67~144 轻度污染 79.3 105.1 20.2 1.230 89.5 76 308
# 2 2015-04-02 74 31~140 良 48.1 79.7 18.8 1.066 51.5 129 231
# 3 2015-04-03 98 49~136 良 72.9 89.2 16.0 1.323 50.9 62 293
# 4 2015-04-04 92 56~158 良 67.6 78.2 14.3 1.506 57.4 93 262
# 5 2015-04-05 87 42~167 良 63.7 56.1 16.9 1.245 50.8 91 215
# 6 2015-04-06 46 36~56 优 29.1 30.8 10.0 0.817 37.5 98 136
# 7 2015-04-07 45 34~59 优 27.0 42.4 12.0 0.640 36.6 77 143
spouse i have a vector tmp of size 100
i want to know where there is for example an average of 10 between
each 4 elements.
i.e
i want to know which of these: mean(tmp[c(1,2,3,4)]),mean(tmp[c(2,3,4,5)]),mean(tmp[c(3,4,5,6)])..and so on...mean(tmp[c(97,98,99,100)])
are larger then 10
how can i do it not in a loop?
(loop takes too long since i have a table of 500000 rows by 60 col)
and more not only avg but also difference or sum and so on...
i have tried splitting rows as such
tmp<-seq(1,100,1)
one<-seq(1,97,1)
two<-seq(2,98,1)
tree<-seq(3,99,1)
four<-seq(4,100,1)
aa<-(tmp[one]+tmp[two]+tmp[tree]+tmp[four])/4
which(aa>10)
its working but its not rational to do it if you want for example avg of 12
here is an example of what i do to be clear
b12<-seq(1,988,1)
b11<-seq(2,989,1)
b10<-seq(3, 990,1)
b9<-seq(4,991,1)
b8<-seq(5,992,1)
b7<-seq(6,993,1)
b6<-seq(7,994,1)
b5<-seq(8, 995,1)
b4<-seq(9,996,1)
b3<-seq(10,997,1)
b2<-seq(11,998,1)
b1<-seq(12,999,1)
now<-seq(13, 1000,1)
po<-rpois(1000,4)
nor<-rnorm(1000,5,0.2)
uni<-runif(1000,10,75)
chis<-rchisq(1000,3,0)
which((po[now]/nor[now])>1 & (nor[b12]/nor[now])>1 &
((po[now]/po[b4])>1 | (uni[now]-uni[b4])>=0) &
((chis[now]+chis[b1]+chis[b2]+chis[b3])/4)>2 &
(uni[now]/max(uni[b1],uni[b2],uni[b3],uni[b4],
uni[b5],uni[b6],uni[b7],uni[b8]))>0.5)+12
this code give me the exact index in the real table
that mach all the conditions
and i have 58 vars with 550000 rows
thank you
The question is not very clear. Based on the wording, I guess, this should help:
n <- 100
res <- sapply(1:(n-3), function(i) mean(tmp[i:(i+3)]))
which(res >10)
Also,
m1 <- matrix(tmp[1:4+ rep(0:96,each=4)],ncol=4,byrow=T)
which(rowMeans(m1) >10)
Maybe you should look at the rollapply function from the "zoo" package. You would need to adjust the width argument according to your specific needs.
library(zoo)
tmp <- seq(1, 100, 1)
rollapply(tmp, width = 4, FUN = mean)
# [1] 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5 11.5 12.5 13.5 14.5 15.5
# [15] 16.5 17.5 18.5 19.5 20.5 21.5 22.5 23.5 24.5 25.5 26.5 27.5 28.5 29.5
# [29] 30.5 31.5 32.5 33.5 34.5 35.5 36.5 37.5 38.5 39.5 40.5 41.5 42.5 43.5
# [43] 44.5 45.5 46.5 47.5 48.5 49.5 50.5 51.5 52.5 53.5 54.5 55.5 56.5 57.5
# [57] 58.5 59.5 60.5 61.5 62.5 63.5 64.5 65.5 66.5 67.5 68.5 69.5 70.5 71.5
# [71] 72.5 73.5 74.5 75.5 76.5 77.5 78.5 79.5 80.5 81.5 82.5 83.5 84.5 85.5
# [85] 86.5 87.5 88.5 89.5 90.5 91.5 92.5 93.5 94.5 95.5 96.5 97.5 98.5
So, to get the details you want:
aa <- rollapply(tmp, width = 4, FUN = mean)
which(aa > 10)