Data format conversion to be combined with string split in R

Data format conversion to be combined with string split in R - r

I have following data frame oridf:
test_name gp1_0month gp2_0month gp1_1month gp2_1month gp1_3month gp2_3month
Test_1 136 137 152 143 156 150
Test_2 130 129 81 78 86 80
Test_3 129 128 68 68 74 71
Test_4 40 40 45 43 47 46
Test_5 203 201 141 134 149 142
Test_6 170 166 134 116 139 125
oridf <- structure(list(test_name = structure(1:6, .Label = c("Test_1",
"Test_2", "Test_3", "Test_4", "Test_5", "Test_6"), class = "factor"),
gp1_0month = c(136L, 130L, 129L, 40L, 203L, 170L), gp2_0month = c(137L,
129L, 128L, 40L, 201L, 166L), gp1_1month = c(152L, 81L, 68L,
45L, 141L, 134L), gp2_1month = c(143L, 78L, 68L, 43L, 134L,
116L), gp1_3month = c(156L, 86L, 74L, 47L, 149L, 139L), gp2_3month = c(150L,
80L, 71L, 46L, 142L, 125L)), .Names = c("test_name", "gp1_0month",
"gp2_0month", "gp1_1month", "gp2_1month", "gp1_3month", "gp2_3month"
), class = "data.frame", row.names = c(NA, -6L))
I need to convert it to following format:
test_name month group value
Test_1 0 gp1 136
Test_1 0 gp2 137
Test_1 1 gp1 152
Test_1 1 gp2 143
.....
Hence, conversion would involve splitting of gp1 and 0month, etc. from columns 2:7 of the original data frame oridf so that I can plot it with following command:
qplot(data=newdf, x=month, y=value, geom=c("point","line"), color=test_name, linetype=group)
How can I convert these data? I tried the melt command, but I cannot combine it with the strsplit command.

First I would use melt like you had done.
library(reshape2)
mm <- melt(oridf)
then there is also a colsplit function you can use in the reshape2 library as well. Here we use it on the variable column to split at the underscore and the "m" in month (ignoring the rest)
info <- colsplit(mm$variable, "(_|m)", c("group","month", "xx"))[,-3]
Then we can recombine the data
newdf <- cbind(mm[,1, drop=F], info, mm[,3, drop=F])
# head(newdf)
# test_name group month value
# 1 Test_1 gp1 0 136
# 2 Test_2 gp1 0 130
# 3 Test_3 gp1 0 129
# 4 Test_4 gp1 0 40
# 5 Test_5 gp1 0 203
# 6 Test_6 gp1 0 170
And we can plot it using the qplot command you supplied above

Use gather from the tidyr package to convert from wide to long and then useseparate from the same package to separate the group_month column into group and month columns. Finally using mutate from dplyr smf extract_numeric from tidyr extract the numeric part of month.
library(dplyr)
# devtools::install_github("hadley/tidyr")
library(tidyr)
newdf <- oridf %>%
gather(group_month, value, -test_name) %>%
separate(group_month, into = c("group", "month")) %>%
mutate(month = extract_numeric(month))

Related

subset a datatable using multiple criteria from second table

I have a very large datatable of ocean sites and multiple depths at each site (table 1).
I need to extract rows matching site location and depth in another datatable (Table 2).
Table 1 - table to be subset
Lat
Long
Depth
Nitrate
165
-77
0
29.5420
165
-77
50
30.2213
165
-77
100
29.2275
124
-46
0
27.8544
124
-46
50
28.6458
124
-46
100
24.9543
76
-24
0
31.9784
76
-24
50
28.6408
76
-24
100
24.9746
25
-62
0
31.9784
25
-62
50
28.6408
25
-62
100
24.9746
Table 2 - co-ordinates and depth needed for subsetting:
Lat
Long
Depth
165
-77
100
76
-24
50
25
-62
0
I have tried to get all sites in a table that would include all available depth data for those sites:
subset <- filter(table1, Lat == table2$Lat | Long == table2$Long)
but it returns zero obs.
Any suggestions?

It seems you are looking for an inner join:
merge(dat1, dat2, by = c("Lat", "Long"))
# Lat Long Depth.x Nitrate Depth.y
# 1 165 -77 100 29.2275 100
# 2 165 -77 0 29.5420 100
# 3 165 -77 50 30.2213 100
# 4 25 -62 0 31.9784 0
# 5 25 -62 50 28.6408 0
# 6 25 -62 100 24.9746 0
# 7 76 -24 0 31.9784 50
# 8 76 -24 50 28.6408 50
# 9 76 -24 100 24.9746 50
There is some risk in this: joins and such rely on strict equality when comparing columns, but floating-point (with many digits of precision) can become too "fine" for most programming languages to detect differences (c.f., Why are these numbers not equal?, Is floating point math broken?, and https://en.wikipedia.org/wiki/IEEE_754). The problem with this problem is that you will get no errors, it just won't produce matches.
To work around that problem, you will need to think about "tolerance", or the distance between points in dat1 and points in dat2, and what distance between all the points is effectively "close enough" to constitute a join. That can be done in one of two ways: (1) calculate the distance between all points in dat1 and all points in dat2, and taking the minimum for each in dat1 (and within a tolerance); or (2) do a "fuzzy join" (using the aptly named fuzzyjoin package) to find points that are within a range (effectively like dat1$Lat between dat2$Lat +/- 0.01 and similar for Long).
Data
dat1 <- structure(list(Lat = c(165L, 165L, 165L, 124L, 124L, 124L, 76L, 76L, 76L, 25L, 25L, 25L), Long = c(-77L, -77L, -77L, -46L, -46L, -46L, -24L, -24L, -24L, -62L, -62L, -62L), Depth = c(0L, 50L, 100L, 0L, 50L, 100L, 0L, 50L, 100L, 0L, 50L, 100L), Nitrate = c(29.542, 30.2213, 29.2275, 27.8544, 28.6458, 24.9543, 31.9784, 28.6408, 24.9746, 31.9784, 28.6408, 24.9746)), class = "data.frame", row.names = c(NA, -12L))
dat2 <- structure(list(Lat = c(165L, 76L, 25L), Long = c(-77L, -24L, -62L), Depth = c(100L, 50L, 0L)), class = "data.frame", row.names = c(NA, -3L))

paste Lat and Long columns together of both the tables and select rows that match.
result <- subset(table1, paste(Lat, Long) %in% paste(table2$Lat, table2$Long))
result

Create a new table from an existing one on a criteria in R

I've done a self-paced reading experiment in which 151 participants read 112 sentences divided into three lists and I'm having some problems cleaning the data in R. I'm not a programmer so I'm kind of struggling with all this!
I've got the results file which looks something like this:
results
part item word n.word rt
51 106 * 1 382
51 106 El 2 286
51 106 asistente 3 327
51 106 del 4 344
51 106 carnicero 5 394
51 106 que 6 274
51 106 abapl’a 7 2327
51 106 el 8 1104
51 106 sabor 9 409
51 106 del 10 360
51 106 pollo 11 1605
51 106 envipi— 12 256
51 106 un 13 4573
51 106 libro 14 660
51 106 *. 15 519
Part=participant; item=sentences; n.word=number of word; rt=reading times.
In the results file, I have the reading times of every word of every sentence read by every participant. Every participant read more or less 40 sentences. My problem is that I am interested in the reading times of specific words, such as the main verb or the last word of each sentence. But as every sentence is a bit different, the main verb is not always in the same position for each sentence. So I've done another table with the position of the words I'm interested in every sentence.
rules
item v1 v2 n1 n2
106 12 7 3 5
107 11 8 3 6
108 11 8 3 6
item=sentence; v1=main verb; v2=secondary verb; n1=first noun; n2=second noun.
So this should be read: For sentence 106, the main verb is the word number 12, the secondary verb is the word number 7 and so on.
I want to have a final table that looks like this:
results2
part item v1 v2 n1 n2
51 106 256 2327 327 394
51 107 ...
52 106 ...
Does anyone know how to do this? It's kind of a from long to wide problem but with a more complex scenario.
If anyone could help me, I would really appreciate it! Thanks!!

You can try the following code, which joins your results data to a reshaped rules data, and then reshapes the result into a wider form.
library(tidyr)
library(dplyr)
inner_join(select(results, -word),
pivot_longer(rules, -item), c("item", "n.word"="value")) %>%
select(-n.word) %>%
pivot_wider(names_from=name, values_from=rt) %>%
select(part, item, v1, v2, n1, n2)
# A tibble: 1 x 6
# part item v1 v2 n1 n2
# <int> <int> <int> <int> <int> <int>
#1 51 106 256 2327 327 394
Data:
results <- structure(list(part = c(51L, 51L, 51L, 51L, 51L, 51L, 51L, 51L,
51L, 51L, 51L, 51L, 51L, 51L, 51L), item = c(106L, 106L, 106L,
106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L, 106L,
106L), word = c("*", "El", "asistente", "del", "carnicero", "que",
"abapl’a", "el", "sabor", "del", "pollo", "envipi—", "un", "libro",
"*."), n.word = 1:15, rt = c(382L, 286L, 327L, 344L, 394L, 274L,
2327L, 1104L, 409L, 360L, 1605L, 256L, 4573L, 660L, 519L)), class = "data.frame", row.names = c(NA,
-15L))
rules <- structure(list(item = 106:108, v1 = c(12L, 11L, 11L), v2 = c(7L,
8L, 8L), n1 = c(3L, 3L, 3L), n2 = c(5L, 6L, 6L)), class = "data.frame", row.names = c(NA,
-3L))

Aggregating monthly data to quarterly (averages)

I have data, which has multiple monthly variables. I would like to aggregate these variables to quarterly level. My initial data is:
Time A B C D . . . . . K
Jan-2004 42 57 53 28
Feb-2004 40 78 56 28
Mar-2004 68 77 53 20
Apr-2004 97 96 80 16
May-2004 84 93 76 17
Jun-2004 57 100 100 21
Jul-2004 62 100 79 22
.
.
.
.
N
So the goal is calculate quarters as monthly averages (sum(jan+feb+mar)/3)). In other words, the goal is to end up data like this:
Time A B C D . . . . . K
2004Q1 50,0 70,7 54,0 25,3
2004Q2 79,3 96,3 85,3 18,0
2004Q3
.
.
.
N
Could anyone help me with this problem?
Thank you very much.

An option would be to convert the 'Time' to yearqtr class with as.yearqtr from zoo and do a summarise_all
library(zoo)
library(dplyr)
df1 %>%
group_by(Time = format(as.yearqtr(Time, "%b-%Y"), "%YQ%q")) %>%
summarise_all(mean)
# A tibble: 3 x 5
# Time A B C D
# <chr> <dbl> <dbl> <dbl> <dbl>
#1 2004Q1 50 70.7 54 25.3
#2 2004Q2 79.3 96.3 85.3 18
#3 2004Q3 62 100 79 22
data
df1 <- structure(list(Time = c("Jan-2004", "Feb-2004", "Mar-2004", "Apr-2004",
"May-2004", "Jun-2004", "Jul-2004"), A = c(42L, 40L, 68L, 97L,
84L, 57L, 62L), B = c(57L, 78L, 77L, 96L, 93L, 100L, 100L), C = c(53L,
56L, 53L, 80L, 76L, 100L, 79L), D = c(28L, 28L, 20L, 16L, 17L,
21L, 22L)), class = "data.frame", row.names = c(NA, -7L))

data.table has quarter function, you can do:
library(data.table)
setDT(my_data)
my_data[ , lapply(.SD, mean), by = .(year = year(Time), quarter = quarter(Time))]
This is the gist of it. Getting it to work exactly would require a reproducible example.

Subset column and compute operations for each subset [duplicate]

This question already has answers here:
How to sum a variable by group
(18 answers)
Closed 5 years ago.
Here is a minimal example of dataframe to reproduce.
df <- structure(list(Gene = structure(c(147L, 147L, 148L, 148L, 148L,
87L, 87L, 87L, 87L, 87L), .Label = c("genome", "k141_1189_101",
"k141_1189_104", "k141_1189_105", "k141_1189_116", "k141_1189_13",
"k141_1189_14", "k141_1189_146", "k141_1189_150", "k141_1189_18",
"k141_1189_190", "k141_1189_194", "k141_1189_215", "k141_1189_248",
"k141_1189_251", "k141_1189_252", "k141_1189_259", "k141_1189_274",
"k141_1189_283", "k141_1189_308", "k141_1189_314", "k141_1189_322",
"k141_1189_353", "k141_1189_356", "k141_1189_372", "k141_1189_373",
"k141_1189_43", "k141_1189_45", "k141_1189_72", "k141_1597_15",
"k141_1597_18", "k141_1597_23", "k141_1597_41", "k141_1597_55",
"k141_1597_66", "k141_1597_67", "k141_1597_68", "k141_1597_69",
"k141_2409_34", "k141_2409_8", "k141_3390_69", "k141_3390_83",
"k141_3390_84", "k141_3726_25", "k141_3726_31", "k141_3726_49",
"k141_3726_50", "k141_3726_62", "k141_3726_8", "k141_3726_80",
"k141_3790_1", "k141_3993_114", "k141_3993_122", "k141_3993_162",
"k141_3993_172", "k141_3993_183", "k141_3993_186", "k141_3993_188",
"k141_3993_24", "k141_3993_25", "k141_3993_28", "k141_3993_32",
"k141_3993_44", "k141_3993_47", "k141_3993_53", "k141_3993_57",
"k141_3993_68", "k141_4255_80", "k141_4255_81", "k141_4255_87",
"k141_5079_107", "k141_5079_110", "k141_5079_130", "k141_5079_14",
"k141_5079_141", "k141_5079_16", "k141_5079_184", "k141_5079_185",
"k141_5079_202", "k141_5079_24", "k141_5079_39", "k141_5079_63",
"k141_5079_65", "k141_5079_70", "k141_5079_77", "k141_5079_87",
"k141_5079_9", "k141_5313_16", "k141_5313_17", "k141_5313_20",
"k141_5313_23", "k141_5313_39", "k141_5313_5", "k141_5313_51",
"k141_5313_52", "k141_5313_78", "k141_5545_101", "k141_5545_103",
"k141_5545_104", "k141_5545_105", "k141_5545_106", "k141_5545_107",
"k141_5545_108", "k141_5545_109", "k141_5545_110", "k141_5545_111",
"k141_5545_112", "k141_5545_113", "k141_5545_114", "k141_5545_119",
"k141_5545_128", "k141_5545_130", "k141_5545_139", "k141_5545_141",
"k141_5545_145", "k141_5545_16", "k141_5545_169", "k141_5545_17",
"k141_5545_172", "k141_5545_6", "k141_5545_60", "k141_5545_62",
"k141_5545_63", "k141_5545_86", "k141_5545_87", "k141_5545_88",
"k141_5545_89", "k141_5545_91", "k141_5545_92", "k141_5545_93",
"k141_5545_94", "k141_5545_96", "k141_5545_97", "k141_5545_98",
"k141_5545_99", "k141_5734_13", "k141_5734_2", "k141_5734_4",
"k141_5734_5", "k141_5734_6", "k141_6014_124", "k141_6014_2",
"k141_6014_34", "k141_6014_75", "k141_6014_96", "k141_908_14",
"k141_908_2", "k141_908_5", "k141_957_126", "k141_957_135", "k141_957_136",
"k141_957_14", "k141_957_140", "k141_957_141", "k141_957_148",
"k141_957_179", "k141_957_191", "k141_957_35", "k141_957_47",
"k141_957_55", "k141_957_57", "k141_957_59", "k141_957_6", "k141_957_63",
"k141_957_65", "k141_957_68", "k141_957_77", "k141_957_95"), class = "factor"),
depth = c(9L, 10L, 9L, 10L, 11L, 14L, 15L, 16L, 17L, 18L),
bases_covered = c(6L, 3L, 4L, 7L, 4L, 59L, 54L, 70L, 34L,
17L), gene_length = c(1140L, 1140L, 591L, 591L, 591L, 690L,
690L, 690L, 690L, 690L), regioncoverage = c(54L, 30L, 36L,
70L, 44L, 826L, 810L, 1120L, 578L, 306L)), .Names = c("Gene",
"depth", "bases_covered", "gene_length", "regioncoverage"), row.names = c(1L,
2L, 33L, 34L, 35L, 78L, 79L, 80L, 81L, 82L), class = "data.frame")
The dataframe looks like this:
Gene depth bases_covered gene_length regioncoverage
1 k141_908_2 9 6 1140 54
2 k141_908_2 10 3 1140 30
33 k141_908_5 9 4 591 36
34 k141_908_5 10 7 591 70
35 k141_908_5 11 4 591 44
78 k141_5079_9 14 59 690 826
79 k141_5079_9 15 54 690 810
80 k141_5079_9 16 70 690 1120
81 k141_5079_9 17 34 690 578
82 k141_5079_9 18 17 690 306
What i want is that for each Gene (e.g k141_908_2) i want to sum region coverage and divide by unique(gene length). In fact gene length is always the same value for each gene.
For example for Gene K141_908_2 i would do: (54+30)/1140 = 0.07
For example for Gene K141_908_5 i would do: (36+70+44)/591 = 0.25
The final dataframe should report two columns.
Gene Newcoverage
1 k141_908_2 0.07
2 k141_908_5 0.25
3 ......
and so on .
Thanks for your help

This is straightforward with dplyr:
library(dplyr)
df_final <- df %>%
group_by(Gene) %>%
summarize(Newcoverage = sum(regioncoverage) / first(gene_length))
df_final
# # A tibble: 3 × 2
# Gene Newcoverage
# <fctr> <dbl>
# 1 k141_5079_9 5.27536232
# 2 k141_908_2 0.07368421
# 3 k141_908_5 0.25380711

I needed to set the first column to character and others to numeric. But after that you can just split the df by gene and then do the necessary calculations.
df[,2:5] = lapply(df[,2:5], as.numeric)
df$Gene = as.character(df$Gene)
sapply(split(df, df$Gene), function(x) sum(x[,5]/x[1,4]))
#k141_5079_9 k141_908_2 k141_908_5
# 5.27536232 0.07368421 0.25380711

We can use tidyverse
library(tidyverse)
df %>%
group_by(Gene) %>%
summarise(Newcoverage = sum(regioncoverage)/gene_length[1])
# A tibble: 3 × 2
# Gene Newcoverage
# <fctr> <dbl>
#1 k141_5079_9 5.27536232
#2 k141_908_2 0.07368421
#3 k141_908_5 0.25380711
Or a base R option is
by(df[4:5], list(as.character(df[,'Gene'])), FUN= function(x) sum(x[,2])/x[1,1])

quick approach is
require(data.table)
DT <- setDT(df)
#just to output unique rows
DT[, .(New_Coverage = unique(sum(regioncoverage)/gene_length)), by = .(Gene)]
output
Gene New_Coverage
1: k141_908_2 0.07368421
2: k141_908_5 0.25380711
3: k141_5079_9 5.27536232

I use dplyr a lot. So here's one way:
library(dplyr)
df %>%
group_by(Gene) %>%
mutate(Newcoverage=sum(regioncoverage)/unique(gene_length))
If you want only unique values per Gene:
df %>%
group_by(Gene) %>%
transmute(Newcoverage=sum(regioncoverage)/unique(gene_length)) %>%
unique()

Calculate stats in concatenated strings in R

Suppose I have a dataframe like this:
X. Name Type Total HP Attack Defense Sp..Atk Sp..Def Speed
795 718 Zygarde50% Forme Dragon/Ground 600 108 100 121 81 95 95
796 719 Diancie Rock/Fairy 600 50 100 150 100 150 50
797 719 DiancieMega Diancie Rock/Fairy 700 50 160 110 160 110 110
798 720 HoopaHoopa Confined Psychic/Ghost 600 80 110 60 150 130 70
799 720 HoopaHoopa Unbound Psychic/Dark 680 80 160 60 170 130 80
800 721 Volcanion Fire/Water 600 80 110 120 130 90 70
If I want to calculate the average stats (Total, HP, Attack, Defense, etc...), per type Dragon, type Ground, type Rock, type Fairy, etc... (instead of type Dragon/Ground, Rock/Fairy), how would I proceed? The stats of pokemons that belong to any two types would be used in calculating the average stats for both.
I have written the code using functions in the dplyr package:
summaryStats_byType<- summarise(byType,
count = n(),
averageTotal = mean(Total, na.rm = T),
averageHP = mean(HP, na.rm = T),
averageDefense = mean(Defense, na.rm = T),
averageSpAtk = mean(Sp..Atk, na.rm = T),
averageSpDef = mean(Sp..Def, na.rm = T),
averageSpeed = mean(Speed, na.rm = T))
but obviously it counts "Dragon/Ground" as a type instead of two.

One way is to split the Type column in long format (I chose cSplit from splitstackshape to do this) and group_by as usual, i.e.
library(splitstackshape)
library(dplyr)
df1 <- cSplit(df, 'Type', sep = '/', 'long')
df1 %>%
group_by(Type) %>%
summarise_each(funs(mean), -c(X., Name))
# A tibble: 9 × 8
# Type Total HP Attack Defense Sp..Atk Sp..Def Speed
# <fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 Dark 680 80 160 60 170 130 80
#2 Dragon 600 108 100 121 81 95 95
#3 Fairy 650 50 130 130 130 130 80
#4 Fire 600 80 110 120 130 90 70
#5 Ghost 600 80 110 60 150 130 70
#6 Ground 600 108 100 121 81 95 95
#7 Psychic 640 80 135 60 160 130 75
#8 Rock 650 50 130 130 130 130 80
#9 Water 600 80 110 120 130 90 70
Alternatively (as noted by #DavidArenburg) we can also use separate_rows from tidyr as part of the pipe, i.e.
library(tidyr)
library(dplyr)
df %>%
separate_rows(Type) %>%
group_by(Type) %>%
summarise_each(funs(mean), -c(X., Name))
which of course yields the same results
DATA
dput(df)
structure(list(X. = c(718L, 719L, 719L, 720L, 720L, 721L), Name = structure(c(6L,
1L, 2L, 3L, 4L, 5L), .Label = c("Diancie", "DiancieMega_Diancie",
"HoopaHoopa_Confined", "HoopaHoopa_Unbound", "Volcanion", "Zygarde50%_Forme"
), class = "factor"), Type = structure(c(1L, 5L, 5L, 4L, 3L,
2L), .Label = c("Dragon/Ground", "Fire/Water", "Psychic/Dark",
"Psychic/Ghost", "Rock/Fairy"), class = "factor"), Total = c(600L,
600L, 700L, 600L, 680L, 600L), HP = c(108L, 50L, 50L, 80L, 80L,
80L), Attack = c(100L, 100L, 160L, 110L, 160L, 110L), Defense = c(121L,
150L, 110L, 60L, 60L, 120L), Sp..Atk = c(81L, 100L, 160L, 150L,
170L, 130L), Sp..Def = c(95L, 150L, 110L, 130L, 130L, 90L), Speed = c(95L,
50L, 110L, 70L, 80L, 70L)), .Names = c("X.", "Name", "Type",
"Total", "HP", "Attack", "Defense", "Sp..Atk", "Sp..Def", "Speed"
), class = "data.frame", row.names = c("795", "796", "797", "798",
"799", "800"))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Data format conversion to be combined with string split in R - r

Related

subset a datatable using multiple criteria from second table

Create a new table from an existing one on a criteria in R

Aggregating monthly data to quarterly (averages)

Subset column and compute operations for each subset [duplicate]

Calculate stats in concatenated strings in R

Categories

Resources