Counting the rows in a data frame based on integer ranges - r

I have a data frame that lists a bunch of objects and their values.
Name NumCpu MemoryMB
1 BEAVERTN-SVR-C5 1 3072
2 BEAVERTN-SVR-UK 4 4096
3 BEAVERTN-SVR-JV 1 1024
I want to take my data frame and create a new column that groups these numbers by ranges.
Ranges: 0-1024, 1025-2048, 2049-4096
And then output the counts of those ranges into a new data frame:
Range Count
0-1024 1
1025-2048 0
2049-4096 2
I learn by doing, so this is a real work problem I'm trying to use R to solve. Any help greatly appreciated. Thank you!
Data
DF <- structure(list(Name = c("BEAVERTN-SVR-C5", "BEAVERTN-SVR-UK",
"BEAVERTN-SVR-JV"), NumCpu = c(1L, 4L, 1L), MemoryMB = c(3072L,
4096L, 1024L), Range = structure(c(3L, 3L, 1L), .Label = c("(0,1.02e+03]",
"(1.02e+03,2.05e+03]", "(2.05e+03,4.1e+03]"), class = "factor")), .Names = c("Name",
"NumCpu", "MemoryMB", "Range"), row.names = c("1", "2", "3"), class = "data.frame")

Related

Removing duplicates of same date and location (3 columns) in R

I know there are like a million questions regarding duplicate removal, but unfortunately
none of them helped me so far. I struggle with the following:
I have a data frame (loc) that includes data of citizen science observations of nature (animals, plants, etc.). It has about 90.000 rows and looks like this:
ID Datum lat long Anzahl Art Gruppe Anrede Wochentag
1 1665376475 2019-05-09 51.30993 9.319896 20 Alytes obstetricans Amphibien Herr Do
2 529728479 2019-05-06 50.58524 8.503332 1 Alytes obstetricans Amphibien Frau Mo
3 1579862637 2019-05-23 50.53925 8.467546 8 Alytes obstetricans Amphibien Herr Do
4 -415013306 2019-05-06 50.58524 8.503332 3 Alytes obstetricans Amphibien Frau Mo
I also made a small sample data frame (loc_sample) of 10 observations and used dput(loc_sample):
structure(list(ID = c(688380991L, -1207894879L, 802295973L, -815104336L, -632066829L, -133354744L, 1929856503L, 952982037L, 1782222413L, 1967897802L),
Datum = structure(c(1559088000, 1558742400, 1557619200, 1557273600, 1557187200, 1557619200, 1557619200, 1557187200, 1557964800, 1556841600),
tzone = "UTC",
class = c("POSIXct", "POSIXt")),
lat = c(52.1236088700115, 51.5928822313012, 53.723426877949, 50.7737623304861, 49.9238597947287, 51.805563222817, 50.1738326622472, 51.2763067511127, 51.395189306337, 51.5732959108075),
long = c(8.62399927116144, 9.89597797393799, 9.04058595819038, 8.20740532922287, 8.29073164862348, 9.9225640296936, 8.79065646492143, 6.40700340270996, 6.47360801696777, 6.25690012620748),
Anzahl = c(2L, 25L, 4L, 1L, 1L, 30L, 2L, 1L, 1L, 1L),
Art = c("Sturnus vulgaris", "Olethreutes arcuella", "Sylvia atricapilla", "Buteo buteo", "Turdus merula", "Orchis mascula subsp. mascula", "Parus major", "Luscinia megarhynchos", "Milvus migrans", "Andrena bicolor"),
Gruppe = c("Voegel", "Schmetterlinge", "Voegel", "Voegel", "Voegel", "Pflanzen", "Voegel", "Voegel", "Voegel", "InsektenSonstige"),
Anrede = c("Herr", "Herr", "Frau", "Herr", "Herr", "Herr", "Herr", "Herr", "Herr", "Herr"),
Wochentag = structure(c(4L, 7L, 1L, 4L, 3L, 1L, 1L, 3L, 5L, 6L),
.Label = c("So", "Mo", "Di", "Mi", "Do", "Fr", "Sa"),
class = c("ordered", "factor"))),
row.names = c(NA, -10L),
class = "data.frame")
For my question only the variables Datum, latand long are important. Datum is a date and in the POSIXct format while lat and long are both numeric. There are quite a few observations that were reported on the same day from the exact same location. I would like to filter and remove those. So I have to check three separate columns and keep only one of each "same-place-same-day" observations.
I already tried putting the three variables in question into one:
loc$dupl <- paste(loc$Datum, loc$lat, loc$long, sep=" ,")
locu <- unique(loc[,2:4])
It seems like I managed to filter the duplicates, but I'm actually not sure, if that's how it is done correctly.
Also, that gives me a data frame with only Datum, lat and long. As a final result I need the original data frame without the duplicates in date and location, but with all the other information for the unique rows still left.
When I try:
locu <- unique(loc[,2:9])
It gives me all the other columns, but it doesn't remove the date and location duplicates.
Thanks in advance for your help!
This can work:
#Code
new <- loc[!duplicated(paste(loc$Datum,loc$lat,loc$long)),]
To get the full data frame back after finding the duplicates, you coudl do sth. like:
loc[!duplicated(loc[,2:4]),]
This code first detects the duplicate rows and then subsets your original data frame.
Note: this code will always keep the first occurences and delete the duplicates in subsequent rows. If you want to keep a certain ID (e.g. the second one, not the first one), we need a different solution.

Is there a way in R to convert the following character variable?

I have the following dataframe with a character variable that represents the number of lanes on a highway, can I replace this vector with a similar vector that has numbers instead of letter?
df<- structure(list(Blocked.Lanes = c("|RS|RS|ML|", "|RS|", "|RS|ML|ML|ML|ML|",
"|RS|", "|RS|RE|", "|ML|ML|ML|", "|RS|ML|", "|RS|", "|ML|ML|ML|ML|ML|ML|",
"|RS|ML|ML|"), Event.Id = c(240314L, 240381L, 240396L, 240796L,
240948L, 241089L, 241190L, 241225L, 241226L, 241241L)), row.names = c(NA,
10L), class = "data.frame")
The output should be something like df2 below:
df2<- structure(list(Blocked.Lanes = c(3L, 1L, 5L, 1L, 2L, 3L, 2L,
1L, 6L, 3L), Event.Id = c(240314L, 240381L, 240396L, 240796L,
240948L, 241089L, 241190L, 241225L, 241226L, 241241L)), class = "data.frame", row.names = c(NA,
-10L))
One way would be to count number of "|" in each string. We subtract it with - 1 since there is an additional "|".
stringr::str_count(df$Blocked.Lanes, '\\|') - 1
#[1] 3 1 5 1 2 3 2 1 6 3
In base R :
lengths(gregexpr("\\|", df$Blocked.Lanes)) - 1
Another way would to be count exact words in the string.
stringr::str_count(df$Blocked.Lanes, '\\w+')
lengths(gregexpr("\\w+", df$Blocked.Lanes))
Similar to Ronak's solution you could also do:
stringr:str_count(df$Blocked.Lanes, "\\b[A-Z]{2}\\b")
if the lanes are always 2 letters long, or
stringr:str_count(df$Blocked.Lanes, "\\b[A-Z]+\\b")
if the lanes are always at least one letter long.
stringr:str_count(df$Blocked.Lanes, "(?<=\\|)[A-Z]+(?=\\|)")
also works.
Not as succinct as #Ronak Shah's, but another method in Base R.
String split on string literal "|" and then count elements:
df2 <- transform(df, Blocked.Lanes = lengths(Map(function(x) x[x != ""],
strsplit(df$Blocked.Lanes, "|", fixed = TRUE))))

how to assign list of values of attribute to single attribute in R

Am at beginner stage of R programming, please help me in below issue.
I have different desc values assigned to the same sol attribute in different rows. I want to make all desc values of sol attribute in single row as mentioned below
My data is as follows:
sol desc
1 fry, toast
1 frt,grt,gty
1 ytr,uyt,ytr
6 hyt, ytr,oiu
4 hyg,hyu,loi
4 opu,yut,yut
I want the output as follows :
sol desc
1 fry,toast,frt,grt,gty,ytr,uyt,yir
6 hyt, ytr,oiu
4 hyg,hyu,loi,opu,yut,yut
Note: you can input any values in desc as per your convenience.
aggregate() is what you are looking for. Try this:
aggregate(desc ~ sol, data = df, paste, collapse = ",")
sol desc
1 1 fry, toast,frt,grt,gty,ytr,uyt,ytr
2 4 hyg,hyu,loi,opu,yut,yut
3 6 hyt, ytr,oiu
Data
df <- structure(list(sol = c(1L, 1L, 1L, 6L, 4L, 4L), desc = c("fry, toast",
"frt,grt,gty", "ytr,uyt,ytr", "hyt, ytr,oiu", "hyg,hyu,loi",
"opu,yut,yut")), .Names = c("sol", "desc"), class = "data.frame", row.names = c(NA,
-6L))

How do I plot boxplots of two different series?

I have 2 dataframe sharing the same rows IDs but with different columns
Here is an example
chrom coord sID CM0016 CM0017 CM0018
7 10 3178881 SP_SA036,SP_SA040 0.000000000 0.000000000 0.0009923
8 10 38894616 SP_SA036,SP_SA040 0.000434783 0.000467464 0.0000970
9 11 104972190 SP_SA036,SP_SA040 0.497802888 0.529319536 0.5479003
and
chrom coord sID CM0001 CM0002 CM0003
4 10 3178881 SP_SA036,SA040 0.526806527 0.544927536 0.565610860
5 10 38894616 SP_SA036,SA040 0.009049774 0.002849003 0.002857143
6 11 104972190 SP_SA036,SA040 0.451612903 0.401617251 0.435318275
I am trying to create a composite boxplot figure where I have in x axis the chrom and coord combined (so 3 points) and for each x value 2 boxplots side by side corresponding to the two dataframes ?
What is the best way of doing this ? Should I merge the two dataframes together somehow in order to get only one and loop over the boxplots rendering by 3 columns ?
Any idea on how this can be done ?
The problem is that the two dataframes have the same number of rows but can differ in number of columns
> dim(A)
[1] 99 20
> dim(B)
[1] 99 28
I was thinking about transposing the dataframe in order to get the same number of column but got lost on how to this properly
Thanks in advance
UPDATE
This is what I tried to do
I merged chrom and coord columns together to create a single ID
I used reshape t melt the dataframes
I merged the 2 melted dataframe into a single one
the head looks like this
I have two variable A2 and A4 corresponding to the 2 dataframes
then I created a boxplot such using this
ggplot(A2A4, aes(factor(combine), value)) +geom_boxplot(aes(fill = factor(variable)))
I think it solved my problem but the boxplot looks very busy with 99 x values with 2 boxplots each
So if these are your input tables
d1<-structure(list(chrom = c(10L, 10L, 11L),
coord = c(3178881L, 38894616L, 104972190L),
sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SP_SA040", class = "factor"),
CM0016 = c(0, 0.000434783, 0.497802888), CM0017 = c(0, 0.000467464,
0.529319536), CM0018 = c(0.0009923, 9.7e-05, 0.5479003)), .Names = c("chrom",
"coord", "sID", "CM0016", "CM0017", "CM0018"), class = "data.frame", row.names = c("7",
"8", "9"))
d2<-structure(list(chrom = c(10L, 10L, 11L), coord = c(3178881L,
38894616L, 104972190L), sID = structure(c(1L, 1L, 1L), .Label = "SP_SA036,SA040", class = "factor"),
CM0001 = c(0.526806527, 0.009049774, 0.451612903), CM0002 = c(0.544927536,
0.002849003, 0.401617251), CM0003 = c(0.56561086, 0.002857143,
0.435318275)), .Names = c("chrom", "coord", "sID", "CM0001",
"CM0002", "CM0003"), class = "data.frame", row.names = c("4",
"5", "6"))
Then I would combine and reshape the data to make it easier to plot. Here's what i'd do
m1<-melt(d1, id.vars=c("chrom", "coord", "sID"))
m2<-melt(d2, id.vars=c("chrom", "coord", "sID"))
dd<-rbind(cbind(m1, s="T1"), cbind(m2, s="T2"))
mm$pos<-factor(paste(mm$chrom,mm$coord,sep=":"),
levels=do.call(paste, c(unique(dd[order(dd[[1]],dd[[2]]),1:2]), sep=":")))
I first melt the two input tables to turn columns into rows. Then I add a column to each table so I know where the data came from and rbind them together. And finally I do a bit of messy work to make a factor out of the chr/coord pairs sorted in the correct order.
With all that done, I'll make the plot like
ggplot(mm, aes(x=pos, y=value, color=s)) +
geom_boxplot(position="dodge")
and it looks like

Combing two data frames if values in one column fall between values in another

I imagine that there's some way to do this with sqldf, though I'm not familiar with the syntax of that package enough to get this to work. Here's the issue:
I have two data frames, each of which describe genomic regions and contain some other data. I have to combine the two if the region described in the one df falls within the region of the other df.
One df, g, looks like this (though my real data has other columns)
start_position end_position
1 22926178 22928035
2 22887317 22889471
3 22876403 22884442
4 22862447 22866319
5 22822490 22827551
And another, l, looks like this (this sample has a named column)
name start end
101 GRMZM2G001024 11149187 11511198
589 GRMZM2G575546 24382534 24860958
7859 GRMZM2G441511 22762447 23762447
658 AC184765.4_FG005 26282236 26682919
14 GRMZM2G396835 10009264 10402790
I need to merge the two dataframes if the values from the start_position OR end_position columns in g fall within the start-end range in l, returning only the columns in l that have a match. I've been trying to get findInterval() to do the job, but haven't been able to return a merged DF. Any ideas?
My data:
g <- structure(list(start_position = c(22926178L, 22887317L, 22876403L,
22862447L, 22822490L), end_position = c(22928035L, 22889471L,
22884442L, 22866319L, 22827551L)), .Names = c("start_position",
"end_position"), row.names = c(NA, 5L), class = "data.frame")
l <- structure(list(name = structure(c(2L, 12L, 9L, 1L, 8L), .Label = c("AC184765.4_FG005",
"GRMZM2G001024", "GRMZM2G058655", "GRMZM2G072028", "GRMZM2G157132",
"GRMZM2G160834", "GRMZM2G166507", "GRMZM2G396835", "GRMZM2G441511",
"GRMZM2G442645", "GRMZM2G572807", "GRMZM2G575546", "GRMZM2G702094"
), class = "factor"), start = c(11149187L, 24382534L, 22762447L,
26282236L, 10009264L), end = c(11511198L, 24860958L, 23762447L,
26682919L, 10402790L)), .Names = c("name", "start", "end"), row.names = c(101L,
589L, 7859L, 658L, 14L), class = "data.frame")

Resources