working with data in tables in R - r

I'm a newbie at working with R. I've got some data with multiple observations (i.e., rows) per subject. Each subject has a unique identifier (ID) and has another variable of interest (X) which is constant across each observation. The number of observations per subject differs.
The data might look like this:
ID Observation X
1 1 3
1 2 3
1 3 3
1 4 3
2 1 4
2 2 4
3 1 8
3 2 8
3 3 8
I'd like to find some code that would:
a) Identify the number of observations per subject
b) Identify subjects with greater than a certain number of observations (e.g., >= 15 observations)
c) For subjects with greater than a certain number of observations, I'd like to to manipulate the X value for each observation (e.g., I might want to subtract 1 from their X value, so I'd like to modify X for each observation to be X-1)
I might want to identify subjects with at least three observations and reduce their X value by 1. In the above, individuals #1 and #3 (ID) have at least three observations, and their X values--which are constant across all observations--are 3 and 8, respectively. I want to find code that would identify individuals #1 and #3 and then let me recode all of their X values into a different variable. Maybe I just want to subtract 1 from each X value. In that case, the code would then give me X values of (3-1=)2 for #1 and 7 for #3, but #2 would remain at X = 4.
Any suggestions appreciated, thanks!

You can use the aggregate function to do this.
a) Say your table is named temp, you can find the total number of observations for each ID and x column by using the SUM function in aggregate:
tot =aggregate(Observation~ID+x, temp,FUN = sum)
The output will look like this:
ID x Observation
1 1 3 10
2 2 4 3
3 3 8 6
b) To see the IDs that are over a certain number, you can create a subset of the table, tot.
vals = tot$ID[tot$Observation>5]
Output is:
[1] 1 3
c) To change the values that were found in (b) you reference the subsetted data, where the number of observations is > 5, and then update those values.
tot$x[vals] = tot$x[vals]+1
The final output for the table is
ID x Observation
1 1 4 10
2 2 4 3
3 3 9 6
To change the original table, you can subset the table by the IDs you found
temp[temp$ID %in% vals,]$x = temp[temp$ID %in% vals,]$x + 1

a) Identify the number of observations per subject
you can use this code on each variable:
summary

Related

if i want to sort a column by size in rstudio, how do i make sure that the associated values of the rows sort with the column?

I have a data.frame with 1200 rows and 5 columns, where each row contains 5 values of one person. now i need to sort one column by size but I want the remaining columns to sort with the column, so that one column is sorted by increasing values and the other columns contain the values of the right persons. ( So that one row still contains data from one and the same person)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
these are the column names of my data.frame and I wanna sort it by the column called "avg"
First of all, please always provide us with a reproducible example such as below. The sorting of a data frame by default sorts all columns.
vector <- 1:3
BAPlotDET <- data.frame(vector, vector, vector, vector, vector)
colnames(BAPlotDET) = c("fsskiddet", "fspiddet","avg", "diff","absdiff")
fsskiddet fspiddet avg diff absdiff
1 1 1 1 1 1
2 2 2 2 2 2
3 3 3 3 3 3
BAPlotDET <- BAPlotDET[order(-BAPlotDET$avg),]
> BAPlotDET
fsskiddet fspiddet avg diff absdiff
3 3 3 3 3 3
2 2 2 2 2 2
1 1 1 1 1 1

Calculate agreement for a specific row in a table in R

Hello! I am new to R and I have this table that I want to find the correlation, how much agreement is here in the 3rd row between all three tosses. How can I calculate this for more than two values for a specific row? (heads is 1, tails is 2) Can I check for agreement between one column vs all the rest?
library(readxl)
> COIN_TOSS <- read_excel("C:/Users/user/Desktop/COIN TOSS.xlsx")
TOSS #1 TOSS #2 TOSS #3
1 2 2 2
2 2 2 1
3 1 1 2
4 2 1 1
5 2 1 1
6 2 1 2
7 1 1 2
8 1 1 2
9 1 1 1
10 2 1 1
Also, I want to print a plot, with the sum of values. I have the top 3 values of each column (10 columns in sum) with this: (Am most frequent values are these)
am <- excel__data$AM
oneam <- sort(table(am),decreasing=TRUE)[1:3]
>am
3 2 4
31 26 24
For the plot I used this, but the y-axis stays the same with the max value being 30, and not all values (stacked up) are visible. How can I change it to go up to 200? Can I use something else besides plot and points?
plot(oneam, pch=10, col='red')
points(onecm, pch=10,col='blue')
points(onefm, pch=10,col='green')
points(onekk, pch=10,col='yellow')
points(onekm, pch=10,col='black')
points(onels, pch=10,col='orange')
"Agreement" and "Correlation" are very different things.
If you want to simply look at "agreement" you could calculate a row-wise mean and standard deviation. Low standard deviations would indicate that all tosses have been fairly close, if you want to you could even standardize by dividing SD/MEAN to get the Coefficient of Variance a % metric.
You can even be more specific and calculate a "distance measure" from one specific toss to the other two e.g.:
library(dplyr)
COIN_TOSS %>%
mutate(Toss3_Delta = ((TOSS1+TOSS2)/2-TOSS3)/((TOSS1+TOSS2)/2))
Now if we are talking about correlation in your example this works only column wise because three cases are not enough to calculate a correlation.
This works:
library(magrittr)
COIN_TOSS %$%
cor()

Take rows with a specific number of repeated values

In R, I have a large dataframe where the first two columns are the primary ID (object) and a secondary ID (element of the object).
I want to create a subset of this dataframe, with the condition that the primary and secondary ID had to be repeated in former dataframe for 20 times. I have also to repeat this process for other dataframes with the same structure.
Right now, I'm first counting how many times each couple of values (primary and secondary IDs) repeats itself in a new dataframe and then using a for loop to create the new dataframe, but the process is extremely slow and inefficient: the loop writes 20 rows/second starting from a dataframe that has from 500.000 to 1 million of rows.
for (i in 1:13){
x <- fread(dataframe_list[i]) #list which contains the dataframes that have to be analyzed
x1 <- ddply(x,.(Primary_ID,Secondary_ID), nrow) #creating a dataframe which shows how many times a couple of values repeats itself
x2 <- subset(x1, x1$V1 == 20) #selecting all couples that are repeated for 20 times
for (n in 1:length(x2$Primary_ID)){
x3 <- subset(x, (x$Primary_ID == x2$Primary_ID[n]) & (x$Secondary_ID == x2$Secondary_ID[n]))
outfiles <- paste0("B:/Results/Code_3_", Band[i], ".csv")
fwrite(x3, file=outfiles, append = TRUE, sep = ",")
}
}
How to take, for example, all the rows from the former dataframe that have as values for the primary and secondary ID the ones obtained in the x2 dataframe at once instead of writing one set of 20 rows at a time? Maybe in SQL is easier but I have to deal with R for now.
Edit:
Sure. Let's say I'm starting from a dataframe like this (with other rows with repeating IDs, I'll just stop to 5 rows to be short):
Primary ID Secondary ID Variable
1 1 1 0.5729
2 1 2 0.6289
3 1 3 0.3123
4 2 1 0.4569
5 2 2 0.7319
Then with my code I count in a new dataframe the repeated rows (for a threshold value of 4 instead of 20, so I can give you a short example):
Primary ID Secondary ID Count
1 1 1 1
2 1 2 3
3 1 3 4
4 2 1 2
5 2 2 4
The wanted output should be a dataframe like this:
Primary ID Secondary ID Variable
1 1 3 0.5920
2 1 3 0.6289
3 1 3 0.3123
4 1 3 0.4569
5 2 2 0.7319
6 2 2 0.5729
7 2 2 0.6289
8 2 2 0.3123
If anyone is interested, I managed to find a way. After counting with the code above how many times the couple of values is repeated, the output that I wanted can be obtained in this simple way:
#Select all the couples that are repeated 20 times
x2 <- subset(x1, x1$V1 == 20)
#Create a dataframe which contains the repeated primary and secondary IDs from x2
x3 <- as.data.frame(cbind(x2$Primary_ID, x2$Secondary_ID)
#Wanted output
dataframe <- inner_join(x, x3)
#Joining, by c("Primary_ID", "Secondary_ID")

sequential counting with input from more than one variable in r

I want to create a column with sequential values but it gets its value from input from two other columns in the df. I want the value to sequentially count if either Team changes (between 1 and 2) or Event = x. Any help would be appreciated! See example below:
Team Event Value
1 1 a 1
2 1 a 1
3 2 a 2
4 2 x 3
5 2 a 3
6 1 a 4
7 1 x 5
8 1 a 5
9 2 x 6
10 2 a 6
This will do it...
df$Value <- cumsum(df$Event=="x" | c(1, diff(df$Team))!=0)
It takes the cumulative sum (i.e. of TRUE values) of those elements where either Event=="x" or the difference in successive values of Team is non-zero. An extra element is added at the start of the diff term to keep it the same length as the original.

group and label rows in data frame by numeric in R

I need to group and label every x observations(rows) in a dataset in R.
I need to know if the last group of rows in the dataset has less than x observations
For example:
If I use a dataset with 10 observations and 2 variables and I want to group by every 3 rows.
I want to add a new column so that the dataset looks like this:
speed dist newcol
4 2 1
4 10 1
7 4 1
7 22 2
8 16 2
9 10 2
10 18 3
10 26 3
10 34 3
11 17 4
df$group <- rep(1:(nrow(df)/3), each = 3)
This works if the number of rows is an exact multiple of 3. Every three rows will get tagged in serial numbers.
A quick dirty way to tackle the problem of not knowing how incomplete the final group is to simply check the remained when nrow is modulus divided by group size: nrow(df) %% 3 #change the divisor to your group size
assuming your data is df you can do
df$newcol = rep(1:ceiling(nrow(df)/3), each = 3)[1:nrow(df)]

Resources