how can I categorize based on several column [duplicate]

how can I categorize based on several column [duplicate] - r

This question already has answers here:
How to reshape data from long to wide format
(14 answers)
Closed 4 years ago.
I have a data like this
df<- structure(list(V1 = structure(c(10L, 4L, 7L, 5L, 3L, 1L, 8L,
11L, 12L, 9L, 2L, 6L), .Label = c("BRA_AC_A6IX", "BRA_BH_A18F",
"BRA_BH_A18V", "BRA_BH_A1ES", "BRA_BH_A1FE", "BRA_BH_A6R8", "BRA_E2_A15A",
"BRA_E2_A15K", "BRA_E2_A1B4", "BRA_EM_A15E", "BRA_LQ_A4E4", "BRA_OK_A5Q2"
), class = "factor"), V2 = structure(c(2L, 3L, 5L, 3L, 3L, 5L,
3L, 4L, 1L, 4L, 2L, 2L), .Label = c("Level ii", "Level iia",
"Level iib", "Level iiia", "Level iiic"), class = "factor"),
V3 = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L,
3L, 4L), .Label = c("amira", "boro", "car", "dim"), class = "factor")), class = "data.frame", row.names = c(NA,
-12L))
I am trying to categorize them based on two column
I can do the following
library(dplyr)
df %>%
+ group_by(V2) %>%
+ summarise(no_rows = length(V2))
# A tibble: 5 x 2
V2 no_rows
<fct> <int>
1 Level ii 1
2 Level iia 3
3 Level iib 4
4 Level iiia 2
5 Level iiic 2
but I want to have an output like this
Amira Boro Car dim
Level ii 1
Level iia 1 1 1
Level iib 1 1 1
Level iiia 1
Level iiic 1 1

How about
library(reshape2)
df1 <- df[,-1]
table(melt(df1, id.var="V2")[-2])

Here is a tidyverse method. I am imputing that you actually want the counts, but if you want just the presence/absence that is easy to add.
df <- structure(list(V1 = structure(c(10L, 4L, 7L, 5L, 3L, 1L, 8L, 11L, 12L, 9L, 2L, 6L), .Label = c("BRA_AC_A6IX", "BRA_BH_A18F", "BRA_BH_A18V", "BRA_BH_A1ES", "BRA_BH_A1FE", "BRA_BH_A6R8", "BRA_E2_A15A", "BRA_E2_A15K", "BRA_E2_A1B4", "BRA_EM_A15E", "BRA_LQ_A4E4", "BRA_OK_A5Q2"), class = "factor"), V2 = structure(c(2L, 3L, 5L, 3L, 3L, 5L, 3L, 4L, 1L, 4L, 2L, 2L), .Label = c("Level ii", "Level iia", "Level iib", "Level iiia", "Level iiic"), class = "factor"), V3 = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 4L), .Label = c("amira", "boro", "car", "dim"), class = "factor")), class = "data.frame", row.names = c(NA, -12L))
library(tidyverse)
df %>%
select(-V1) %>%
count(V2, V3) %>%
spread(V3, n, fill = 0L)
#> # A tibble: 5 x 5
#> V2 amira boro car dim
#> <fct> <int> <int> <int> <int>
#> 1 Level ii 0 0 1 0
#> 2 Level iia 1 0 1 1
#> 3 Level iib 1 2 1 0
#> 4 Level iiia 0 0 2 0
#> 5 Level iiic 1 1 0 0
Created on 2018-05-23 by the reprex package (v0.2.0).

Related

Calculating and looping summaries for individual participants into a table

I have data from several hundred participants who each provided between 1 and 6 sentences. They then rated their sentence(s) on 4 dimensions, as did two external raters.
I'd like to create a table, grouped by participant, with columns showing these values:
Participants' rate of agreement with rater 1 (par1), with rater 2 (par2) and overall (paro)
Participants' rate of agreement for each dimension with rater 1 (pad1.1, pad2.1 etc.), with rater 2 (pad1.2, pad2.2 etc.) and overall (pad1.o, pad2.o etc.)
Mean difference in rating between participant and rater 1 (mdrp1), rater 2 (mdrp2) and both raters (mdrpo)
Mean difference in rating for each dimension between participant and rater 1 (mdr1p1, mdr2p1 etc.), rater 2 (mdr1p2, mdr2p2 etc.) and both raters (mdr1po, mdr2po etc.)
(So with 4 dimensions there should be 30 values per participant)
Due to the size and structure of the data, I'm not sure where to start on this. I'm guessing that a loop would be necessary, but I've struggled to get my head around how to do that as well.
For agreement I'm considering adding TRUE/FALSE variables and then replacing them with 1 and 0 to eventually calculate agreement:
df <- df %>% mutate(par1 = (df$d1 == df$r1.1)
df <- df %>% mutate(par2 = (df$d1 == df$r2.1)
df <- df %>% mutate(paro = (df$d1 == df$r1.1 & df$d1 == df$r2.1)
And similarly for mean differences, adding variables with rating difference for each dimension...
df <- df %>% mutate(mdr1p1 = (df$d1 - df$r1.1))
df <- df %>% mutate(mdr1p2 = (df$d1 - df$r2.1))
df <- df %>% mutate(mdr1po = (df$d1 - ((df$r1.1 + df$r2.1)/2)))
...But these seem to be quite inefficient approaches!
My data looks like this:
ID Ans d1 d2 d3 d4 r1.1 r1.2 r1.3 r1.4 r2.1 r2.2 r2.3 r2.4
1 53 abc 3 3 3 3 3 2 4 3 3 2 4 3
2 a4 def 3 3 3 3 3 1 2 3 3 1 3 3
3 a4 ghi 4 4 4 4 3 2 5 1 3 1 5 2
4 hj jkl 3 3 3 3 3 1 3 3 3 1 5 3
5 32 mno 2 3 3 3 3 1 3 2 3 1 3 3
6 32 pqr 3 3 3 2 3 2 5 3 4 2 3 3
ID = participant
Ans = participants' written answer
d = dimension rated by participant
r1 = dimensions rated by external rater 1
r2 = dimensions rated by external rater 2
Example data:
structure(list(ID = c(1L, 2L, 2L, 3L, 4L, 4L, 5L),
Ans = c("abc", "def", "ghi", "jkl", "mno", "pqr", "stu"),
d1 = c(3L, 3L, 4L, 3L, 2L, 3L, 3L), d2 = c(3L, 3L, 4L, 3L, 3L, 3L, 1L),
d3 = c(3L, 3L, 4L, 3L, 3L, 3L, 1L), d4 = c(3L, 3L, 4L, 3L, 3L, 2L, 3L),
r1.1 = c(3L, 3L, 3L, 3L, 3L, 3L, 3L), r1.2 = c(2L, 1L, 2L, 1L, 1L, 2L, 3L),
r1.3 = c(4L, 2L, 5L, 3L, 3L, 5L, 3L), r1.4 = c(3L, 3L, 1L, 3L, 2L, 3L, 2L),
r2.1 = c(3L, 3L, 3L, 3L, 3L, 4L, 3L), r2.2 = c(2L, 1L, 1L, 1L, 1L, 2L, 1L),
r2.3 = c(4L, 3L, 5L, 5L, 3L, 3L, 5L), r2.4 = c(3L, 3L, 2L, 3L, 3L, 3L, 2L)),
row.names = c(1L, 2L, 3L, 4L, 5L, 6L), class = "data.frame")

how to select specific row by a column

I have a data, as an example I show below
a = rep(1:5, each=3)
b = rep(c("a","b","c","a","c"), each = 3)
df = data.frame(a,b)
I want to select all the rows that have the "a"
I tried to do it with
df[df$a %in% a,]
Can someone give me an idea how to get them out?
df2<- structure(list(V1 = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), V2 = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 1L, 2L, 3L, 4L,
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L), .Label = c("B02", "B03",
"B04", "B05", "B06", "B07", "C02", "C03", "C04", "C05", "C06",
"C07"), class = "factor")), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA,
-24L))
I want to select specific rows that start with B but not all of them and just 02, 03, 04, 05
1 B02
1 B03
1 B04
1 B05
2 B02
2 B03
2 B04
2 B05
I also want to have the original data without them too

We need to check the 'b' column
df[df$b %in% 'a',]
For the updated question with 'df2', we can use paste to create the strings 'B02' to 'B05' and use %in% to subset
df2[df2$V2 %in% paste0("B0", 2:5),]
Or another option is grep
df2[grep("^B0[2-5]$", df2$V2),]

> df
a b
1 1 a
2 1 a
3 1 a
4 2 b
5 2 b
6 2 b
7 3 c
8 3 c
9 3 c
10 4 a
11 4 a
12 4 a
13 5 c
14 5 c
15 5 c
This basically says:
For all columns in df choose rows that have value equal to a
> rows_with_a<-df[df$b=='a', ]
> rows_with_a
a b
1 1 a
2 1 a
3 1 a
10 4 a
11 4 a
12 4 a

ggplot2 - how to create a clustered timeline?

How would you go about creating the graph below in R? I want to show the duration of different treatments for different patients.
Mock data here:
Start Day Stop Day
Patient 1 Drug 1 1 3
Drug 2 2 5
Drug 3 3 8
Patient 2 Drug 1 2 4
Drug 2 2 5
Drug 3 1 6
Patient 3 Drug 1 4 7
Drug 2 3 8
Drug 3 5 6

Your graph can be generated using geom_segment in the ggplot2 package:
df <- structure(list(Patient = structure(c(1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L, 3L), .Label = c("Patient1", "Patient2", "Patient3"), class = "factor"),
Drug = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L), .Label = c("Drug1",
"Drug2", "Drug3"), class = "factor"), StartDay = c(1L, 2L,
3L, 2L, 2L, 1L, 4L, 3L, 5L), StopDay = c(3L, 5L, 8L, 4L,
5L, 6L, 7L, 8L, 6L)), .Names = c("Patient", "Drug", "StartDay",
"StopDay"), class = "data.frame", row.names = c(NA, -9L))
df$Drug <- factor(df$Drug, levels(df$Drug)[c(3,2,1)])
library(ggplot2)
ggplot(data=df, aes(color=Drug))+
geom_segment(aes(x=StartDay, xend=StopDay, y=Drug, yend=Drug),lwd=12)+
facet_grid(Patient~.)+xlab("Days")

Get sum of unique rows in table function in R

Suppose I have data which looks like this
Id Name Price sales Profit Month Category Mode Supplier
1 A 2 5 8 1 X K John
1 A 2 6 9 2 X K John
1 A 2 5 8 3 X K John
2 B 2 4 6 1 X L Sam
2 B 2 3 4 2 X L Sam
2 B 2 5 7 3 X L Sam
3 C 2 5 11 1 X M John
3 C 2 5 11 2 X L John
3 C 2 5 11 3 X K John
4 D 2 8 10 1 Y M John
4 D 2 8 10 2 Y K John
4 D 2 5 7 3 Y K John
5 E 2 5 9 1 Y M Sam
5 E 2 5 9 2 Y L Sam
5 E 2 5 9 3 Y M Sam
6 F 2 4 7 1 Z M Kyle
6 F 2 5 8 2 Z L Kyle
6 F 2 5 8 3 Z M Kyle
if I apply table function, it will just combines are the rows and result will be
K L M
X 4 4 1
Y 2 1 3
Z 0 1 2
Now what if I want not the sum of all rows but only sum of those rows with Unique Id
so it looks like
K L M
X 2 2 1
Y 1 1 2
Z 0 1 1
Thanks

If df is your data.frame:
# Subset original data.frame to keep columns of interest
df1 <- df[,c("Id", "Category", "Mode")]
# Remove duplicated rows
df1 <- df1[!duplicated(df1),]
# Create table
with(df1, table(Category, Mode))
# Mode
# Category K L M
# X 2 2 1
# Y 1 1 2
# Z 0 1 1
Or in one line using unique
table(unique(df[c("Id", "Category", "Mode")])[-1])
df <- structure(list(Id = c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), Name = structure(c(1L, 1L, 1L,
2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 6L), .Label = c("A",
"B", "C", "D", "E", "F"), class = "factor"), Price = c(2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L
), sales = c(5L, 6L, 5L, 4L, 3L, 5L, 5L, 5L, 5L, 8L, 8L, 5L,
5L, 5L, 5L, 4L, 5L, 5L), Profit = c(8L, 9L, 8L, 6L, 4L, 7L, 11L,
11L, 11L, 10L, 10L, 7L, 9L, 9L, 9L, 7L, 8L, 8L), Month = c(1L,
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L,
3L), Category = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("X", "Y", "Z"
), class = "factor"), Mode = structure(c(1L, 1L, 1L, 2L, 2L,
2L, 3L, 2L, 1L, 3L, 1L, 1L, 3L, 2L, 3L, 3L, 2L, 3L), .Label = c("K",
"L", "M"), class = "factor"), Supplier = structure(c(1L, 1L,
1L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 2L, 2L, 2L
), .Label = c("John", "Kyle", "Sam"), class = "factor")), .Names = c("Id",
"Name", "Price", "sales", "Profit", "Month", "Category", "Mode",
"Supplier"), class = "data.frame", row.names = c(NA, -18L))

We can try
library(data.table)
dcast(unique(setDT(df1[c('Category', 'Mode', 'Id')])),
Category~Mode, value.var='Id', length)
# Category K L M
#1: X 2 2 1
#2: Y 1 1 2
#3: Z 0 1 1
Or with dplyr
library(dplyr)
df1 %>%
distinct(Id, Category, Mode) %>%
group_by(Category, Mode) %>%
tally() %>%
spread(Mode, n, fill=0)
# Category K L M
# (chr) (dbl) (dbl) (dbl)
#1 X 2 2 1
#2 Y 1 1 2
#3 Z 0 1 1
Or as #David Arenburg suggested, a variant of the above is
df1 %>%
distinct(Id, Category, Mode) %>%
select(Category, Mode) %>%
table()

Replace NA values in dataframe variable with values from other dataframe by "ID"

I would like to know if there is a more concise way to replace NA values for a variable in a dataframe than what I did below. The code below seems to be longer than what I think might be possible in R. For example, I am unaware of some package/tool that might do this more succinctly.
Is there a way to replace, or merge values only if they are NA? After merging two dataframes using all.x = T I have some NA values, I'd like to replace those with information from another dataframe using a common variable to link the replacement.
# get dataframes
breaks <- structure(list(Break = 1:11, Value = c(2L, 13L, 7L, 9L, 40L,
21L, 10L, 37L, 7L, 26L, 42L)), .Names = c("Break", "Value"), class = "data.frame", row.names = c(NA,
-11L))
fsites <- structure(list(Site = c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L,
3L, 3L, 3L, 3L), Plot = c(0L, 1L, 2L, 3L, 4L, 0L, 1L, 2L, 0L,
1L, 2L, 3L, 4L, 5L), Break = c(1L, 5L, 7L, 8L, 11L, 1L, 6L, 11L,
1L, 4L, 6L, 8L, 9L, 11L)), .Names = c("Site", "Plot", "Break"
), class = "data.frame", row.names = c(NA, -14L))
bps <- structure(list(Site = c(1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 3L,
3L), Plot = c(0L, 1L, 2L, 3L, 1L, 2L, 0L, 1L, 2L, 3L, 4L), Value = c(0.393309653,
0.12465733, 0.27380161, 0.027288989, 0.439712533, 0.289724079,
0.036429062, 0.577460008, 0.820375917, 0.323217357, 0.28637503
)), .Names = c("Site", "Plot", "Value"), class = "data.frame", row.names = c(NA,
-11L))
# merge fsites and bps
df1 <- merge(fsites, bps, by=c("Site", "Plot"), all.x=T)
# merge df1 and breaks to get values to eventually replace the NA values in
# df1$Values.x, here "Break" is the ID by which to replace the NA values
df2 <- merge(df1, breaks, by=c("Break"))
# Create a new column 'Value' that uses Value.x, unless NA, then Value.y
df3 <- df2
df3$Value <- df2$Value.x
df2.na <- is.na(df2$Value.x)
df3$Value[df2.na] <- df2$Value.y[df2.na]
# get rid of unnecessary columns
cols <- c(1:3,6)
df4 <- df3[,cols]

At the stage where there is only (breaks, fsites, bps and) df1 around:
df1$Value <- ifelse(is.na(df1$Value),
breaks$Value[match(df1$Break, breaks$Break)], df1$Value)
#> df1
# Site Plot Break Value
#1 1 0 1 0.39330965
#2 1 1 5 0.12465733
#3 1 2 7 0.27380161
#4 1 3 8 0.02728899
#5 1 4 11 42.00000000
#6 2 0 1 2.00000000
#7 2 1 6 0.43971253
#8 2 2 11 0.28972408
#9 3 0 1 0.03642906
#10 3 1 4 0.57746001
#11 3 2 6 0.82037592
#12 3 3 8 0.32321736
#13 3 4 9 0.28637503
#14 3 5 11 42.00000000
#just to test with your `df4`
> sort(df1$Value) == sort(df4$Value)
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

how can I categorize based on several column [duplicate] - r

How about library(reshape2) df1 <- df[,-1] table(melt(df1, id.var="V2")[-2])

Related

Calculating and looping summaries for individual participants into a table

how to select specific row by a column

ggplot2 - how to create a clustered timeline?

Get sum of unique rows in table function in R

Replace NA values in dataframe variable with values from other dataframe by "ID"

Categories

Resources