I have basic knowledge of R, but I'm not sure how to go about the following programming:
I have a large data frame of data from 25 participants. Each participant has a different total number of sound files (i.e., #1 might have 105 and #2 might have 98). For each participant, we've coded whether or not they're in a specific location (i.e., if they're in their apartment, 1, if not, 0).
I want to get the count of how many times Apartment=1 for each participant. This is where I'm confused: my Excel sheet is organized such that all of the participants are stacked on top of each other. So the first 105 rows are for Participant #1, and the next 98 rows are for Participant #2.
Do I need to aggregate my data by participant number, and then get the count?
I've never written a for loop, but is that another possible solution?
Related
Basically I'm trying to sort through a large dataset where the first 3 numbers correspond to different texts. Before I can filter through, I'm trying to assign the different strings values.
Crops 101
Fishing 102
Livestock 103
Movies, TV, & Stage 201
In the larger dataset, there are hundreds of numbers such as 1018347 where the first three numbers correspond to crops and include the number of times that value appeared. The numbers after specify what type of crops, but for the purpose of my work I need to sort through the entire thing by the first three numbers and sum the amount for each time occurred. I'm fairly new to R and wasn't able to find a sufficient answer, so any help would be appreciated.
Not sure if i am getting your question correctly but it seems you are looking for a way to first create a new variable based on the first three numbers in the variable and afterwards summarize the results in a sum per category
What could work is
data %>%
mutate(first_part = substr(variable,1,3)) %>%
group_by(first_part) %>%
summarize(occurrences= n())
Code above counts the amount of times the "first_part"(which is the first 3 numbers) is occurring. This can also be reproduces for the second part or both together.
I have a dataset with people's IDs (some people (IDs) have multiple rows) and risks (transcribed in a number between 1 and 7, and some NA) and I wanted to count the number of people in each risk group, without counting the same person twice.
When creating a subset containing only 1 row/person, I obtain a certain number of people for each group. However, when I use this function (for each risk group):
length(unique(data$person_id[data$RISK==1])
it seems like I get one extra person in each risk group (so 7 extra people in total).
Does someone have an explanation for this? Do I have to do -1 each time I use this function?
Thanks in advance!
Unfortunately, I'm totally lost with R. I grew up with SPSS, which I really regret right now.
My problem:
I have several Excel files
In each file I have several rows, which belong to one specific participant. Each row consists of several columns. E.g. for participant A several rows with several columns, for participant B several rows with several columns and so on
My goal is to have just one row for each participant with all the data in columns
This means, I need a syntax which moves the second row of participant A (and so on) to the column end of the first row
Afterwards the next row of participant A needs to be moved to the end of the first row. This should be iterated until I have only one row for participant A with all the data in columns.
Then I need to do same for participant B, C and so on
Is there a possibility to do this with r? I'm so lost.
Best,
Jonas
I have a complex dataframe (orig_df). Of the 25 columns, 5 are descriptions and characteristics that I wish to use as grouping criteria. The remainder are time series. There are tens of thousands of rows.
I noted in initial analysis and numerical summary that there are significant issues with outlier observations within some of the specific grouping criteria. I used "group by" and looking at the quintile results within those groups. I would like to eliminate the low and high (individual observation) outliers relative to the (group-by based quintile) to improve the decision tree and clustering analytics. I also want to keep the outliers to analyze separately for the root cause.
How do I manipulate the dataframe such that the individual observations are compared to the group-based quintile results and the parse is saved (orig_df becomes ideal_df and outlier_df)?
After identifying the outliers using the link Nikos Tavoularis share above, you can use ifelse to create a new variable and identify which records are outliers and the ones that are not. This way you can keep the data there, but you can use this new variable to sort them out whenever you want
I think this will be relatively elementary, but I cannot for the life of me figure it out.
Imagine a dataset in which there are 108 rows, made up of two readings for 54 clones. Pretty much, I need to condense a dataset based on clone (column 2), by averaging the cells from [6:653], whilst keeping the information for column 1, 2, 3, 654 (which is identical for these columns between the two readings).
I have a pretty small dataset, in which I have 108 rows, and 654 columns, which i would like to whittle down to a smaller dataset. Now, the rows consist of 54 different tree clones (column 2), each with two readings (column 4) (54 * 2 = 108). I would like to average the two readings for each clone, reducing my dataset to 54 rows. Just FYI, the first 5 columns are characters, the next 648 are numeric. I would like to remove columns 4 and 5 from the new dataset, leaving a dataset of 54x652, but this is optional.
I believe that a (plyr) function or something will do the trick, but i can't make it work. I've tried a bunch of things, but it just won't play ball.
Thanks in advance.
For average you can use mean for leaving out a row or column just subtract the row.
Example:
table[-x, ] - deletes the x row
table[ ,-x] - deletes the x column
(x can be one number or x<-c(1:3) # the first three rows/columns)
If you provide more information I think others will also help.