Use Cases – The TRUMAN Project

Use Case 1: Households’ daily lives and lifestyles

Market and scientific research on people’s lives has mostly been based on the administration of questionnaires. This seriously limits the possibility of capturing the complex process of relationships, division of duties, roles, and behaviour between the subjects in time and social and physical space. Conducting market and scientific research on people’s lives ‘observing in nature’ means focusing on understanding real-world behaviour and interactions without the influence of controlled environments or self-reported data. By observing “in the wild,” researchers focus on naturalistic observation and can provide more authentic insights into how household members live, make decisions, and engage with their social and physical environment. In a nutshell, these data can be used as input to develop new models, including AI models, covering a wide range of social topics, e.g. to help improve the lives of households and individuals, to inform public stakeholders to develop public policies and private companies to develop new business projects.
Furthermore, after data collection is complete, another critical issue is related to data pre-processing. Here, the main issues concern the quality of the data and the presence of anomalies and distortions (noisy, missing, and outliers) in the collected data, generated either by the participants (e.g. memory errors, typing errors, etc.) or by the technology used (e.g. app operation, sensor errors, OS, etc.).
Of the four scenarios proposed by TRUMAN, this use case aims to implement scenario 3 and to explore some of the key issues driving this project proposal related to data collection and low data quality that negatively impacts both standard statistical analysis and artificial intelligence models. As it is well known in social methodological research, only economic incentives reduce dropout and increase the cooperation of participants [48, 49]. Therefore, to increase the cooperation of household members, each household will be remunerated.

Use Case 2: In-browser visual-based phishing detection

Phishing represents the topmost form of cybercrime, and its proliferation is constantly increasing, essentially with phishing websites among the most common adversarial vectors. After deploying their phishing hooks in the wild, attackers try to lure their victims to such malicious web pages – to steal their private data or compromise their IT systems. Countermeasures are: (1) human-centered, which aims to improve the ability of humans to avoid traps; (2) machine-centered (e.g., phishing website detectors PWD), which aims to prevent the human user from landing on a phishing trap in the first place. Such machine-centered solutions – typically deployed on the cloud or on-premise for big enterprises- entail the analysis of a larger number of potentially malicious URLs and the application of ML to detect phishing websites by leveraging features extracted from the URL, domain, and website content, as well as comparing their visual similarity with the legitimate webpages of well-known brands. However, malicious actors continuously refine their strategies to deploy malicious pages that evade even production-grade PWDs. The most common form of evasion – known as cloaking – consists of masking the malicious content to any automated system intent on analysing the page and only showing it to the browsers whose usage and characteristics hint at the presence of a real human behind them. Although some solutions exist to mask the automated nature of web scanners, evasion is still possible by requiring the user to perform specific actions or by simply showing the malicious content to agents connecting from specific IPs in the regions the phishing campaign is intended to target. Knowing upfront this detail is challenging and does not scale to protect users worldwide.
Our key idea is to change the detection perspective to match the one of the users. We will develop a PWD that detects brand impersonation by using the visual information collected inside the users’ browsers and other features extracted from the URL without the need for suspicious feeds in input or automated scanners. The ultimate goal of our use case will be to instantiate Scenario 2 among the four proposed by TRUMAN. The development of our solution will undergo a first phase in which the data pre-processing and the model training are defined, evaluated, and improved on an entity (i.e., out company infrastructure) that differs from the data collector and the model executors (i.e., end-user machines).

Use Case 3: Detecting Branch Staff Abuse with Federated Learning

This use case aims to detect fraudulent transactions carried out by bank employees without the knowledge of customers. The basis of the use case is the detection of such transactions and the protection of both the customer and the bank. It is important to take precautions in advance to avoid being victimised. The bank’s data will be used in the project after being anonymised, while developments are carried out in the cloud; These data will be modeled using various technologies in a graph-based manner. During this modeling, the data will be divided according to the branches of the bank, and each branch will be used as a separate node. From each graph, the resulting information will be sent anonymously to the main node, and the information of the branches will remain anonymous. On the master node, it will be decided whether a transaction is fraudulent or not.
In the project, we will create a reliable system since the data of the bank branches are not seen by the main node. Our use case will mainly focus on trustworthiness and distributed computing. We have prior know-how regarding the collection and processing of data regarding the project. We will be able to act more dynamically. In Phase 1, we will decide how the data we have should be modeled and the suitability of the AI methods we have used previously. In Phase 2, we will complete our developments with the technologies we have decided on. In Phase 3, we will ensure its integration into suitable use cases.

Use Case 4: LLM-powered chatbot for better disease management

The PDMonitor Ecosystem, developed by PD Neurotechnology, is designed to assist Parkinson’s disease patients in managing both motor and non-motor symptoms. The ecosystem includes a mobile application that functions as a digital diary for patients to log symptoms, medication, and nutrition throughout the day. Additionally, a chatbot, powered by a large language model (LLM), is being developed to integrate with the app, providing continuous support through conversation, personalised health tracking, disease management coaching, and cognitive assessments. The primary goal is to empower patients, improve patient-physician communication, and enhance the management of Parkinson’s disease. Given the sensitive nature of health data, the security and privacy of the AI tools are of paramount concern, with a specific focus on potential vulnerabilities such as prompt hacking and jailbreaking attacks. In case of prompt injection attacks, malicious users may exploit the chatbot by using crafted prompts to manipulate its responses, inducing the chatbot to provide inaccurate medical advice, potentially leading to harmful health outcomes for patients (medically-false information), provoking the chatbot to generate aggressive or offensive language towards patients or caregivers (hate Speech), and extracting sensitive medical or personal information, risking privacy breaches and unauthorised access (data leakage). In case of jailbreaking attacks, malicious users attempt to bypass the chatbot’s built-in restrictions designed to prevent misuse, manipulating the chatbot to ignore safety protocols and discuss prohibited topics or provide unethical advice (bypassing content restrictions), and using the chatbot for unintended purposes, such as promoting harmful behaviour or sharing sensitive operational information (inappropriate use).
The PDMonitor Ecosystem deals with various types of sensitive data through retrieval-augmented generation (RAG) techniques. Ensuring data quality is paramount, as inaccurate or incomplete patient data could compromise the chatbot’s robustness in maintaining safe, reliable, and appropriate interactions. Missing data or noisy/outlier data, such as inconsistent symptom tracking or inaccurate self-reports, can degrade model performance. Moreover, data bias in the patient dataset may lead to unfair treatment recommendations, as the model could favor certain demographics over others, raising concerns about fairness. Given the sensitive nature of health data, the security and privacy of the AI tools are of paramount concern. In this ecosystem, the data owners (Parkinson’s disease patients), who interact with the chatbot by logging their symptoms, medications, and other personal health information, could be considered adversarial and may attempt to exploit vulnerabilities in the chatbot (Scenario 4).
This use case aims to explore the robustness in terms of fairness and reliability as well as the security and privacy risks involved in using an LLM-powered chatbot, the PDNeuroBot, that has access to both personal data and medical knowledge. The handling of personal data poses significant risks of data leaks, while the access to medical knowledge could be exploited to provide inaccurate medical advices, potentially leading to harmful health outcomes for patients.