I have summarized a column of a data frame (call this DATA) that consists of IDs so I get the total number of each ID in the given column. I'd like to convert this to another data frame (call this TOTALNUM), so I have two columns. The first column is the ID itself and the second column is the total number of each ID. Is this possible?
Sample data:
ids <- c(1,2,3,4,5,1,2,3,1,5,1,4,2,2,2)
info <- c("A","B","C","A","B","C","A","B","C","A","B","C","A","B","C")
DATA <- data.frame(ids, info)
DATA$ids <- as.factor(DATA$ids)
What I would like to put in a data frame:
Top row would be the first column in a new data frame.
Second row would be the second column in a new data frame.
summary(DATA$ids)
This is what I would like the data frame to look like:
ids nums
1 4
2 5
3 2
4 2
5 2
Thanks!!
With your approach, you can take advantage of the fact that summary returns a vector of counts, with names for each value of ids:
> my.summary <- summary(DATA$ids)
> data.frame(ids=names(my.summary), nums=my.summary)
ids nums
1 1 4
2 2 5
3 3 2
4 4 2
5 5 2
Or--and this approach is more straightforward--you can create a frequency table based on ids and then convert that to a data frame:
> as.data.frame(table(ids), responseName="nums")
ids nums
1 1 4
2 2 5
3 3 2
4 4 2
5 5 2
Related
I have a data.frame with 1200 rows and 5 columns, where each row contains 5 values of one person. now i need to sort one column by size but I want the remaining columns to sort with the column, so that one column is sorted by increasing values and the other columns contain the values of the right persons. ( So that one row still contains data from one and the same person)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
these are the column names of my data.frame and I wanna sort it by the column called "avg"
First of all, please always provide us with a reproducible example such as below. The sorting of a data frame by default sorts all columns.
vector <- 1:3
BAPlotDET <- data.frame(vector, vector, vector, vector, vector)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
fsskiddet fspiddet avg diff absdiff
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
BAPlotDET <- BAPlotDET[order(-BAPlotDET$avg),]
> BAPlotDET
fsskiddet fspiddet avg diff absdiff
3 3 3 3 3 3
2 2 2 2 2 2
1 1 1 1 1 1
In R, I have a large dataframe where the first two columns are the primary ID (object) and a secondary ID (element of the object).
I want to create a subset of this dataframe, with the condition that the primary and secondary ID had to be repeated in former dataframe for 20 times. I have also to repeat this process for other dataframes with the same structure.
Right now, I'm first counting how many times each couple of values (primary and secondary IDs) repeats itself in a new dataframe and then using a for loop to create the new dataframe, but the process is extremely slow and inefficient: the loop writes 20 rows/second starting from a dataframe that has from 500.000 to 1 million of rows.
for (i in 1:13){
x <- fread(dataframe_list[i]) #list which contains the dataframes that have to be analyzed
x1 <- ddply(x,.(Primary_ID,Secondary_ID), nrow) #creating a dataframe which shows how many times a couple of values repeats itself
x2 <- subset(x1, x1$V1 == 20) #selecting all couples that are repeated for 20 times
for (n in 1:length(x2$Primary_ID)){
x3 <- subset(x, (x$Primary_ID == x2$Primary_ID[n]) & (x$Secondary_ID == x2$Secondary_ID[n]))
outfiles <- paste0("B:/Results/Code_3_", Band[i], ".csv")
fwrite(x3, file=outfiles, append = TRUE, sep = ",")
}
}
How to take, for example, all the rows from the former dataframe that have as values for the primary and secondary ID the ones obtained in the x2 dataframe at once instead of writing one set of 20 rows at a time? Maybe in SQL is easier but I have to deal with R for now.
Edit:
Sure. Let's say I'm starting from a dataframe like this (with other rows with repeating IDs, I'll just stop to 5 rows to be short):
Primary ID Secondary ID Variable
1 1 1 0.5729
2 1 2 0.6289
3 1 3 0.3123
4 2 1 0.4569
5 2 2 0.7319
Then with my code I count in a new dataframe the repeated rows (for a threshold value of 4 instead of 20, so I can give you a short example):
Primary ID Secondary ID Count
1 1 1 1
2 1 2 3
3 1 3 4
4 2 1 2
5 2 2 4
The wanted output should be a dataframe like this:
Primary ID Secondary ID Variable
1 1 3 0.5920
2 1 3 0.6289
3 1 3 0.3123
4 1 3 0.4569
5 2 2 0.7319
6 2 2 0.5729
7 2 2 0.6289
8 2 2 0.3123
If anyone is interested, I managed to find a way. After counting with the code above how many times the couple of values is repeated, the output that I wanted can be obtained in this simple way:
#Select all the couples that are repeated 20 times
x2 <- subset(x1, x1$V1 == 20)
#Create a dataframe which contains the repeated primary and secondary IDs from x2
x3 <- as.data.frame(cbind(x2$Primary_ID, x2$Secondary_ID)
#Wanted output
dataframe <- inner_join(x, x3)
#Joining, by c("Primary_ID", "Secondary_ID")
Say I have some data created like this
n <- 3
K <- 4
dat <- expand.grid(var1=1:n, var2=1:K)
dat looks like this:
var1 var2
1 1 1
2 2 1
3 3 1
4 1 2
5 2 2
6 3 2
7 1 3
8 2 3
9 3 3
10 1 4
11 2 4
12 3 4
I want to remove some rows from both data frames in the list at the same time. Let's say I want to remove the 11th row, and I want the 'gap' to be filled in, so that now the 12th row will become the 11th row.
I understand this is a list of two data frames. Thus the advice here does not apply, since dat[[11]]<-NULL would do nothing, while dat[[2]]<-NULL would remove the second data frame from the list
lapply(dat,"[",11) lets me access the relevant elements, but I don't know how to remove them.
Assuming that we want to remove rows from a list of data.frames, we loop the list elements using lapply and remove the rows using numeric index.
lapply(lst, function(x) x[-11,])
Or without the anonymous function
lapply(lst, `[`, -11,)
The 'dat' is a data.frame.
is.data.frame(dat)
#[1] TRUE
If we want to remove rows from 'dat',
dat[-11,]
If the row.names also needs to be changed
`row.names<-`(dat[-11,], NULL)
data
lst <- list(dat, dat)
Right now I have two data frames in R, contains some data that looks like this:
> data
p a i
1 1 1 2.2561469
2 5 2 0.2316390
3 2 3 0.4867456
4 3 1 0.1511705
5 4 2 0.8838884
And one the contains coefficients that looks like this:
> coef
3 2 1
1 29420.50 31029.75 29941.96
2 26915.00 27881.00 27050.00
3 27756.00 28904.00 28699.40
4 28345.33 29802.33 28377.56
5 28217.00 29409.00 28738.67
These data frames are connected as each value in data$a corresponds to a column name in coef and data$p corresponds to row names in coef.
I need to apply these coefficients to multiply these coefficients by the values in data$i by matching the row and column names in coef to data$a and data$p.
In other words, for each row in data, I need to use data$a and data$p for each row to pull a specific number from coef that will be multiplied by the value of data$i for that row to create a new vector in data that looks something like this:
> data
p a i z
1 1 1 2.2561469 67553
2 5 2 0.2316390 6812
3 2 3 0.4867456 .
4 3 1 0.1511705 .
5 4 2 0.8838884 .
I was thinking I should create factors in my coef data frame based on the row and column names but am unsure of where to go from there.
Thanks in advance,
Ian
If you order your coef data.frame, you can just index them as though the column names weren't there.
coef <- coef[,order(names(coef))]
Then apply a function to each row:
myfun <- function(x) {
x[3]*coef[x[1], x[2]]
}
data$z <- apply(data, 1, myfun)
> data
p a i z
1 1 1 2.2561469 67553.460
2 5 2 0.2316390 6812.271
3 2 3 0.4867456 13100.758
4 3 1 0.1511705 4338.503
5 4 2 0.8838884 26341.934
>
I would like to rename a large number of columns (column headers) to have numerical names rather than combined letter+number names. Because of the way the data is stored in raw format, I cannot just access the correct column numbers by using data[[152]] if I want to interact with a specific column of data (because random questions are filtered completely out of the data due to being long answer comments), but I'd like to be able to access them by data$152. Additionally, approximately half the columns names in my data have loaded with class(data$152) = NULL but class(data[[152]]) = integer (and if I rename the data[[152]] file it appropriately allows me to see class(data$152) as integer).
Thus, is there a way to use the loop iteration number as a column name (something like below)
for (n in 1:415) {
names(data)[n] <-"n" # name nth column after number 'n'
}
That will reassign all my column headers and ensure that I do not run into question classes resulting in null?
As additional background info, my data is imported from a comma delimited .csv file with the value 99 assigned to answers of NA with the first row being the column names/headers
data <- read.table("rawdata.csv", header=TRUE, sep=",", na.strings = "99")
There are 415 columns with headers in format Q001, Q002, etc
There are approximately 200 rows with no row labels/no label column
You can do this without a loop, as follows:
names(data) <- 1:415
Let me illustrate with an example:
dat <- data.frame(a=1:4, b=2:5, c=3:6, d=4:7)
dat
a b c d
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
Now rename the columns:
names(dat) <- 1:4
dat
1 2 3 4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7
EDIT : How to access your new data
#Ramnath points out very accurately that you won't be able to access your data using dat$1:
dat$1
Error: unexpected numeric constant in "dat$1"
Instead, you will have to wrap the column names in backticks:
dat$`1`
[1] 1 2 3 4
Alternatively, you can use a combination of character and numeric data to rename your columns. This could be a much more convenient way of dealing with your problem:
names(dat) <- paste("x", 1:4, sep="")
dat
x1 x2 x3 x4
1 1 2 3 4
2 2 3 4 5
3 3 4 5 6
4 4 5 6 7