efficient string value count in large data.frame

efficient string value count in large data.frame - r

I have a large dataframe (~ 600K rows) with a string-value column (link)
doc_id,link
1,http://example.com
1,http://example.com
2,http://test1.net
2,http://test2.net
2,http://test5.net
3,http://test1.net
3,http://example.com
4,http://test5.net
and I would like to count the number of times a certain string value occurs in the frame. The result should look like this:
link, count
http://example.com, 3
http://test1.net, 2
http://test2.net, 1
http://test5.net, 2
Is there an efficient way to do this in R? Converting the frame into a matrix doesn't work because of the frame size. Currently I am using the plyr package, but this is too slow.

The table function counts occurrences - and it's very fast compared to ddply. So, something like this perhaps:
# some sample data
set.seed(42)
df <- data.frame(doc_id=1:10, link=sample(letters[1:3], 10, replace=TRUE))
cnt <- as.data.frame(table(df$link))
# Assign appropriate names (optional)
names(cnt) <- c("link", "count")
cnt
Which gives the following output:
link count
1 a 2
2 b 3
3 c 5

Related

R function that creates indicator variable values unique between several columns

I'm using the Drug Abuse Warning Network data to analyze common drug combinations in ER visits. Each additional drug is coded by a number in variables DRUGID_1....16. So Pt1 might have DRUGID_1 = 44 (cocaine) and DRUGID_3 = 20 (heroin), while Pt2 might have DRUGID_1=20 (heroin), DRUGID_3=44 (cocaine).
I want my function to loop through DRUGID_1...16 and for each of the 2 million patients create a new binary variable column for each unique drug mention, and set the value to 1 for that pt. So a value of 1 for binary variable Heroin indicates that somewhere in the pts DRUGID_1....16 heroin is mentioned.
respDRUGID <- character(0)
DRUGID.df <- data.frame(allDAWN$DRUGID_1, allDAWN$DRUGID_2, allDAWN$DRUGID_3)
Count <- 0
DrugPicker <- function(DRUGID.df){
for(i in seq_along(DRUGID.df$allDAWN.DRUGID_1)){
if (!'NA' %in% DRUGID.df[,allDAWN.DRUGID_1]){
if (!is.element(DRUGID.df$allDAWN.DRUGID_1,respDRUGID)){
Count <- Count + 1
respDRUGID[Count] <- as.character(DRUGID.df$allDAWN.DRUGID_1[Count])
assign(paste('r', as.character(respDRUGID[Count,]), sep='.'), 1)}
else {
assign(paste("r", as.character(respDRUGID[Count,]), sep='.'), 1)}
}
}
}
DrugPicker(DRUGID.df)
Here I have tried to first make a list to contain each new DRUGIDx value (respDRUGID) as well as a counter (Count) for the total number unique DRUGID values and a new dataframe (DRUGID.df) with just the relevant columns.
The function is supposed to move down the observations and if not NA, then if DRUGID_1 is not in list respDRUGID then create a new column variable 'r.DRUGID' and set value to 1. Also increase the unique count by 1. Otherwise the value of DRUGID_1 is already in list respDRUGID then set r.DRUGID = 1
I think I've seen suggestions for get() and apply() functions, but I'm not following how to use them. The resulting dataframe has to be in the same obs x variable format so merging will align with the survey design person weight variable.

Taking a guess at your data and required result format. Using package tidyverse
drug_df <- read.csv(text='
patient,DRUGID_1,DRUGID_2,DRUGID_3
A,1,2,3
B,2,,
C,2,1,
D,3,1,2
')
library(tidyverse)
gather(drug_df, value = "DRUGID", ... = -patient, na.rm = TRUE) %>%
arrange(patient, DRUGID) %>%
group_by(patient) %>%
summarize(DRUGIDs = paste(DRUGID, collapse=","))
# patient DRUGIDs
# <fctr> <chr>
# 1 A 1,2,3
# 2 B 2
# 3 C 1,2
# 4 D 1,2,3

I found another post that does exactly what I want using stringr, destring, sapply and grepl. This works well after combining each variable into a string.
Creating dummy variables in R based on multiple chr values within each cell
Many thanks to epi99 whose post helped think about the problem in another way.

Data handling: 2 independent factors, which decide the position of a numeric value in a new data frame

I am new to Stackoverflow and to R, so I hope you can be a bit patient and excuse any formatting mistakes.
I am trying to write an R-script, which allows me to automatically analyze the raw data of a qPCR machine.
I was quite successful in cleaning up the data, but at some point I run into trouble. My goal is to consolidate the data into a comprehensive table.
The initial data frame (DF) looks something like this:
Sample Detector Value
1 A 1
1 B 2
2 A 3
3 A 2
3 B 3
3 C 1
My goal is to have a dataframe with the Sample-names as row names and Detector as column names.
A B C
1 1 2 NA
2 3 NA NA
3 2 3 1
My approach
First I took out the names of samples and detectors and saved them in vectors as factors.
detectors = summary(DF$Detector)
detectors = names(detectors)
samples = summary(DF$Sample)
samples = names(samples)
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
Then I subsetted the detectors into a new dataframe based on the name of the detector in the dataframe.
for (i in 1:length(detectors)){
assign(detectors[i], DF[which(DF$Detector == detectors[i]),])
}
Then I initialize an empty dataframe with the right column and row names:
result = data.frame(matrix(NA, nrow = length(samples), ncol = length(detectors)))
colnames(result) = detectors
rownames(result) = samples
So now the Problem. I have to get the values from the detector subsets into the result dataframe. Here it is important that each values finds the way to the right position in the dataframe. The issue is that there are not equally many values since some samples lack some detectors.
I tried to do the following: Iterate through the detector subsets, compare the rowname (=samplename) with each other and if it's the same write the value into the new dataframe. In case it it is not the same, it should write an NA.
for (i in 1:length(detectors)){
for (j in 1:length(get(detectors[i])$Sample)){
result[j,i] = ifelse(get(detectors[i])$Sample[j] == rownames(result[j,]), get(detectors[i])$Ct.Mean[j], NA)
}
}
The trouble is, that this stops the iteration through the detector$Sample column and it switches to the next detector. My understanding is that the comparing samples get out of sync, yielding the all following ifelse yield a NA.
I tried to circumvent it somehow by editing the ifelse(test, yes, no) NO with j=j+1 to get it back in sync, but this unfortunately didn't work.
I hope I could make my problem understandable to you!
Looking forward to hear any suggestions, or comments (also how to general improve my code ;)

We can use acast from library(reshape2) to convert from 'long' to 'wide' format.
acast(DF, Sample~Detector, value.var='Value') #returns a matrix output
# A B C
#1 1 2 NA
#2 3 NA NA
#3 2 3 1
If we need a data.frame output, use dcast.
Or use spread from library(tidyr), which will also have the 'Sample' as an additional column.
library(tidyr)
spread(DF, Detector, Value)

How to select specific elements and find their index in a data.frame?

I would like to select specific elements of a data.list after processing it.
To get process parameters I describe the my problem in the reproducible example.
In the example code below, I have three sets of data.list each have 5 column.
Each data.list repeat theirselves three times each and each data.list assignet to unique number called set_nbr which defines these datasets.
#to create reproducible data (this part creates three sets of data each one repeats 3 times of those of Mx, My and Mz values along with set_nbr)
set.seed(1)
data.list <- lapply(1:3, function(x) {
nrep <- 3
time <- rep(seq(90,54000,length.out=600),times=nrep)
Mx <- c(replicate(nrep,sort(runif(600,-0.014,0.012),decreasing=TRUE)))
My <- c(replicate(nrep,sort(runif(600,-0.02,0.02),decreasing=TRUE)))
Mz <- c(replicate(nrep,sort(runif(600,-1,1),decreasing=TRUE)))
df <- data.frame(time,Mx,My,Mz,set_nbr=x)
})
after applying some function I have output like this.
result
time Mz set_nbr
1 27810 -1.917835e-03 1
2 28980 -1.344288e-03 1
3 28350 -3.426615e-05 1
4 27900 -9.934413e-04 1
5 25560 -1.016492e-02 2
6 27360 -4.790767e-03 2
7 28080 -7.062256e-04 2
8 26550 -1.171716e-04 2
9 26820 -2.495893e-03 3
10 26550 -7.397865e-03 3
11 26550 -2.574022e-03 3
12 27990 -1.575412e-02 3
My questions starts from here.
1) How to get min,middle and max values of time column, for each set_nbr ?
2) How to use evaluated set_nbr and Mz values inside of data.list?
In short;
After deciding the min,middle and max values from time column and corresponding Mz values for each set_nbr in result, I want to return back to original data.list and extract those columns of Mx, My, Mz according those of set_nbr and Mz values. Since each set_nbr actually corresponding to 600 rows, I would like to extract those defined set_nbrs family from data.list
we use time as a factor to select set_nbr. Here factor means as extraction parameter not the real factor in R command.
In addition, as you will see four set_nbr exist for each dataset but they are indeed addressing different dataset in the data.list

I'm a big advocate of using lists of data frames when appropriate, but in this case it doesn't look like there's any reason to keep them separated as different list items. Let's combine them into a single data frame.
library(dplyr)
dat = bind_rows(data.list)
Then getting your summary stats is easy:
dat %>% group_by(set_nbr) %>%
summarize(min_time = min(time),
max_time = max(time),
middle_time = median(time))
# Source: local data frame [3 x 4]
#
# set_nbr min_time max_time middle_time
# 1 1 90 54000 27045
# 2 2 90 54000 27045
# 3 3 90 54000 27045
In your sample data, time is defined the same way each time, so of course the min, median, and max are all the same.
I'd suggest, in the new question you ask about plotting, starting with the combined data frame dat.
As to your second question:
2) How to select evaluated set_nbr values inside of data.list?
Selecting a single item from a list, use double brackets
data.list[[2]]
However, with the combined data, it's just a normal column of a normal data frame so any of these will work:
dat[dat$set_nbr == 2, ]
subset(dat, set_nbr == 2)
filter(dat, set_nbr == 2)
To your clarification in comments, if you want the Mx and My values for the time and set_nbr in the results object, using my combined dat above, simply do a join: left_join(results, dat).
This should work, but I'm a little confused because in your simulated data time is numeric, but in your new text you say "we use time as a factor". If you've converted time to a factor object, this will only work if it has the same levels in each of the data frames in your data list. If not, I would recommend keeping time as numeric.

R functions that output datasets

I am a bit new to R and am trying to use a function to output a dataframe. I have several dataframes that need deduplication. Each record in the data frame has an index variable (RecID) and a patient ID (PatID). If patients are listed multiple times in the dataframe, I want to choose the record largest RecID.
I want to be able to change this data frame:
PatID RecID
1 1
1 2
2 3
3 4
3 5
4 6
Into this dataframe
PatID RecID
1 2
2 3
3 5
4 6
I can use the following code to successfully deduplicate the dataframe.
df <- df[order(df$PatID, -df$RecID),]
df <- df[ !duplicated(df$PatID), ]
I created a function with this code so I can apply my deduplication scheme across multiple data frames easily.
dedupit <- function(x) {
x <- x[order(x$PatID, -x$RecID),]
x <- x[ !duplicated(x$PatID), ]
}
However, when I put use the code dedupit(df), it does not create a new df dataframe with deduplicated records.The function won't output the final dataframes or any of the intermediate dataframes. Is there a way to have functions output dataframes?

You need to put return(x) at the end of your function.

Subsetting a DataFrame with a List

I am trying to use a random list of sample IDs generated in R as the basis to subset a larger dataframe. I realize that I can just randomly subset directly but I'd like to know in case next time I need to subset using specific sample names.
I have data (mydata) that looks like this:
Species.Obj.ID Sample Sample.Name
1 23 George
2 90 Adilade
3 34 Frank
4 23 Steve
I generated random numbers
rand<-sample(1:4, 2, replace=F)
Then tried to subset the data using the example from here (How to filter a table's row based on an external list?).
subset(mydata,subset= Species.Obj.ID %in% rand)
But I just get back the full original data (mydata) in return.
I'd like to get the rows in the datatable from the random numbers in rand.
I'd greatly appreciate any advice on how to correct this. Thanks!

This way is a bit easier than using subset
> yourData <- data.frame(Species.Obj.ID = 1:4,
Sample = c(23,90,34,23),
Sample.Name = c("George", "Adilade", "Frank", "Steve"))
> rand <- sample(1:4, 2, replace = FALSE)
> rand
[1] 2 3 # the random numbers are 2 and 3
> yourData[rand, ] # the rows shown are rows 2 and 3
Species.Obj.ID Sample Sample.Name
2 2 90 Adilade
3 3 34 Frank
You can even shorten this all by using
yourData[sample(1:4, 2, replace = FALSE), ]
This will return a random sample of 2 rows of your original data.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

efficient string value count in large data.frame - r

Related

R function that creates indicator variable values unique between several columns

Data handling: 2 independent factors, which decide the position of a numeric value in a new data frame

How to select specific elements and find their index in a data.frame?

R functions that output datasets

Subsetting a DataFrame with a List

Categories

Resources